CyclicDist在多个语言环境中变慢

问题描述 投票:2回答:2

我尝试使用CyclicDist模块实现矩阵乘法的实现。

当我使用一种语言环境与两种语言环境进行测试时,一种语言环境要快得多。是因为在两个Jetson纳米板之间进行通信的时间确实很大,还是我的实现没有利用CyclicDist的工作方式?

这是我的代码:

 use Random, Time, CyclicDist;
var t : Timer;
t.start();

config const size = 10;
const Space = {1..size, 1..size};

const gridSpace = Space dmapped Cyclic(startIdx=Space.low);
var grid: [gridSpace] real;
fillRandom(grid);
const gridSpace2 = Space dmapped Cyclic(startIdx=Space.low);
var grid2: [gridSpace2] real;
fillRandom(grid2);
const gridSpace3 = Space dmapped Cyclic(startIdx=Space.low);
var grid3: [gridSpace] real;
forall i in 1..size do {
    forall j in 1..size do {
        forall k in 1..size do {
            grid3[i,j] += grid[i,k] * grid2[k,j];
        }
    }
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear()

我知道我的实现对于分布式存储系统不是最佳的。

performance parallel-processing low-latency chapel parallelism-amdahl
2个回答
0
投票

Q是因为沟通时间 (1) 在两个Jetson纳米板之间]非常大or是我的实现 (2) 没有利用CyclicDist的工作方式?]

第二个选项是肯定的选择:~ 100 x更糟糕

在小尺寸的CyclicDist数据上实现了性能。

[Documentation对此明确warns,说:

循环分布将索引从给定索引开始以循环模式映射到语言环境。 ...局限性该发行版尚未针对性能进行调整。

在单语言环境平台上可以证明对处理效率的不利影响,在该平台上所有数据都位于语言环境-本地内存空间中,因此无需增加任何NUMA板间通信附加成本。与~ 100 x Vass' single-forall{}

累加和积
相比,仍然达到[[forall{}更差性能

((直到现在才注意到Vass的性能促使其从原始D3变为另一个配置的forall-in-D3-do-{}串联迭代的修订版-到目前为止,小型--fast --ccflags -O3执行的测试显示< forall-in-D2-do-for{}-迭代器-迭代器结果的[几乎一半的长度WORSE

性能,甚至比O / P三重forall-in-D2-do-for{}原始提案还要差,但尺寸小于512x512且在进行-O3优化之后,但是对于最小的尺寸128x128,原始的Vass-D3单独迭代器的最高性能达到了每个单元forall{},令人惊讶的是没有--ccflags -O3(对于处理更大的~ 850 [ns]数据布局可能会明显改变) ,如果将更广泛的NUMA多语言环境和更高的并行度设备投入竞争,则越多)))>--size={ 1024 | 2048 | 4096 | 8192 }TiO.run platform uses 1 numLocales, having 2 physical CPU-cores accessible (numPU-s) with 2 maxTaskPar parallelism limit 的使用会影响DATA到内存的布局,不是吗?
通过对

小尺寸

CyclicDist的测量验证,有无--size={128 | 256 | 512 | 640}轻微影响

--size={128 | 256 | 512 | 640}

无论如何,Chapel团队的见解(在设计和测试方面都很重要)。 @Brad被要求提供一种帮助,以便为较大的尺寸--ccflags -O3和具有多语言环境和多语言环境解决方案的“更广泛的” -NUMA平台提供类似的测试覆盖率和比较,Cray可为Chapel团队的研发不会受到硬件和公开,赞助,共享// -------------------------------------------------------------------------------------------------------------------------------- // --fast // ------ // // For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 255818 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3075 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 3040 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2198 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1974 [us] excl. fillRandom()-ops <-- 127x SLOWER with CyclicDist dmapped DATA // For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2122 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 252439 [us] excl. fillRandom()-ops // // For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2141444 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product took 27095 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 25339 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product took 23493 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 21631 [us] excl. fillRandom()-ops <-- 98x SLOWER then w/o CyclicDist dmapped data // For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 21971 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2122417 [us] excl. fillRandom()-ops // // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16988685 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17448207 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product took 268111 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 270289 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product took 250896 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 239898 [us] excl. fillRandom()-ops <-- 71x SLOWER with dmapped CyclicDist DATA // For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 257479 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17391049 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16932503 [us] excl. fillRandom()-ops <~~ ~2e5 [us] faster without --ccflags -O3 // // For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35136377 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the tested forall sum-product took 362205 [us] incl. fillRandom()-ops <-- 97x SLOWER with dmapped CyclicDist DATA // For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 367651 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345865 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 337896 [us] excl. fillRandom()-ops <-- 103x SLOWER with dmapped CyclicDist DATA // For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 351101 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35052849 [us] excl. fillRandom()-ops <~~ ~3e4 [us] faster without --ccflags -O3 // // -------------------------------------------------------------------------------------------------------------------------------- // --fast --ccflags -O3 // -------------------- // // For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 250372 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3189 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2966 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2284 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1949 [us] excl. fillRandom()-ops <-- 126x FASTER than with dmapped CyclicDist DATA // For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2072 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 246965 [us] excl. fillRandom()-ops // // For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2114615 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product took 37775 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 38866 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product took 32384 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 29264 [us] excl. fillRandom()-ops <-- 71x FASTER than with dmapped CyclicDist DATA // For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 33973 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2098344 [us] excl. fillRandom()-ops // // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17136826 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081273 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product took 251786 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 266766 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product took 239301 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 233003 [us] excl. fillRandom()-ops <~~ ~6e3 [us] faster with --ccflags -O3 // For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 253642 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17025339 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081352 [us] excl. fillRandom()-ops <~~ ~2e5 [us] slower with --ccflags -O3 // // For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35164630 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the tested forall sum-product took 363060 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 489529 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345742 [us] excl. fillRandom()-ops <-- 104x SLOWER with dmapped CyclicDist DATA // For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 353353 [us] excl. fillRandom()-ops <-- 102x SLOWER with dmapped CyclicDist DATA // For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 471213 [us] excl. fillRandom()-ops <~~~12e5 [us] slower with --ccflags -O3 // For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35075435 [us] excl. fillRandom()-ops 平台上--size={1024 | 2048 | 4096 | 8192 | ...}限制的困扰。

此程序未按比例缩放的主要原因可能是计算从未使用除初始语言环境之外的任何语言环境。具体来说,forall会在范围内循环,例如代码中的循环:
~ 60 [s]

总是使用在当前语言环境中执行的任务来运行所有迭代。这是因为范围不是在Chapel中分配的值,因此,它们的并行迭代器不会在区域设置之间分配工作。结果,循环体的所有大小** 3次执行:

TiO.RUN

将在语言环境0上运行,而没有一个将在语言环境1上运行。通过将以下内容放入最内层的循环主体中,您可以看到是这种情况:

forall i in 1..size do

((grid3[i,j] += grid[i,k] * grid2[k,j]; 打印出当前任务正在运行的语言环境的ID)。这将显示语言环境0正在运行所有迭代:

writeln("locale ", here.id, " running ", (i,j,k));

与在here.id之类的分布式域上运行forall循环进行对比:

0 running (9, 1, 1)
0 running (1, 1, 1)
0 running (1, 1, 2)
0 running (9, 1, 2)
0 running (1, 1, 3)
0 running (9, 1, 3)
0 running (1, 1, 4)
0 running (1, 1, 5)
0 running (1, 1, 6)
0 running (1, 1, 7)
0 running (1, 1, 8)
0 running (1, 1, 9)
0 running (6, 1, 1)
...

迭代将在区域之间分配的位置:

gridSpace

由于所有计算都在语言环境0上运行,但是一半的数据位于语言环境1上(由于分布了数组),因此生成了大量通信,以从语言环境1的内存中获取远程值到语言环境0,以便计算。


0
投票
~ 60 [s]
© www.soinside.com 2019 - 2024. All rights reserved.