对于任何简单的操作,包含单个基元的readonly struct
应该或多或少与基元本身一样快。
以下所有测试都是在Windows 7 x64上运行.NET Core 2.2,代码优化。在.NET 4.7.2上测试时,我也得到了类似的结果。
使用long
类型测试此前提,似乎这有:
// =============== SETUP ===================
public readonly struct LongStruct
{
public readonly long Primitive;
public LongStruct(long value) => Primitive = value;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static LongStruct Add(in LongStruct lhs, in LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static long LongAdd(long lhs, long rhs) => lhs + rhs;
// =============== TESTS ===================
public static void TestLong(long a, long b, out long result)
{
var sw = Stopwatch.StartNew();
for (var i = 1000000000; i > 0; --i)
{
a = LongAdd(a, b);
}
sw.Stop();
result = a;
return sw.ElapsedMilliseconds;
}
public static void TestLongStruct(LongStruct a, LongStruct b, out LongStruct result)
{
var sw = Stopwatch.StartNew();
for (var i = 1000000000; i > 0; --i)
{
a = LongStruct.Add(a, b);
}
sw.Stop();
result = a;
return sw.ElapsedMilliseconds;
}
// ============= TEST LOOP =================
public static void RunTests()
{
var longStruct = new LongStruct(1);
var count = 0;
var longTime = 0L;
var longStructTime = 0L;
while (true)
{
count++;
Console.WriteLine("Test #" + count);
longTime += TestLong(1, 1, out var longResult);
var longMean = longTime / count;
Console.WriteLine($"Long: value={longResult}, Mean Time elapsed: {longMean} ms");
longStructTime += TestLongStruct(longStruct, longStruct, out var longStructResult);
var longStructMean = longStructTime / count;
Console.WriteLine($"LongStruct: value={longStructResult.Primitive}, Mean Time elapsed: {longStructMean} ms");
Console.WriteLine();
}
}
使用LongAdd
以便测试循环匹配 - 每个循环调用一个方法来进行一些添加,而不是内联的原始情况
在我的机器上,这两次已经稳定在彼此的2%之内,足够近以至于我确信它们已经针对几乎相同的代码进行了优化。
IL的差异相当小:
LongAdd
vs LongStruct.Add
)。LongStruct.Add
有一些额外的指示:
一对ldfld
指令从结构中加载Primitive
一个newobj
指令将新的long
包装回LongStruct
所以要么抖动正在优化这些指令,要么它们基本上是免费的。
如果我使用上面的代码并用long
替换每个double
,我会期望相同类型的结果(绝对值较慢,因为add指令稍微慢一点,但两者都是相同的余量)。
我实际看到的是DoubleStruct
版本比double
版本慢大约4.8倍(即480%)。
IL与long
案例相同(除了为int64
和LongStruct
交换float64
和DoubleStruct
),但不知何故,运行时正在为DoubleStruct
案件或LongStruct
案件中不存在的double
案件做额外工作。
测试一些其他原始类型,我看到float
(465%)的行为与double
相同,而short
和int
的行为方式与long
相同,所以它似乎是关于浮点的一些因素导致一些优化不被采取。
为什么DoubleStruct
和FloatStruct
比double
和float
慢得多,其中long
,int
和short
等同物没有遭受这种减速?
这不是一个单独的答案,但它在x86和x64上都是一个更严格的基准测试,所以希望它能为其他可以解释这个问题的人提供更多信息。
我试图用BenchmarkDotNet复制它。我也想知道删除in
会有什么区别。我把它作为x86和x64单独运行。
x86(LegacyJIT)
| Method | Mean | Error | StdDev |
|----------------------- |---------:|---------:|---------:|
| TestLong | 257.9 ms | 2.099 ms | 1.964 ms |
| TestLongStruct | 529.3 ms | 4.977 ms | 4.412 ms |
| TestLongStructWithIn | 526.2 ms | 6.722 ms | 6.288 ms |
| TestDouble | 256.7 ms | 1.466 ms | 1.300 ms |
| TestDoubleStruct | 342.5 ms | 5.189 ms | 4.600 ms |
| TestDoubleStructWithIn | 338.7 ms | 3.808 ms | 3.376 ms |
x64(RyuJIT)
| Method | Mean | Error | StdDev |
|----------------------- |-----------:|----------:|----------:|
| TestLong | 269.8 ms | 5.359 ms | 9.099 ms |
| TestLongStruct | 266.2 ms | 6.706 ms | 8.236 ms |
| TestLongStructWithIn | 270.4 ms | 4.150 ms | 3.465 ms |
| TestDouble | 270.4 ms | 5.336 ms | 6.748 ms |
| TestDoubleStruct | 1,250.9 ms | 24.702 ms | 25.367 ms |
| TestDoubleStructWithIn | 577.1 ms | 12.159 ms | 16.644 ms |
我可以使用RyuJIT在x64上复制此内容,但不能在使用LegacyJIT的x86上复制此内容。这似乎是RyuJIT管理优化long
案例但不是double
案件的工件 - LegacyJIT也没有管理优化。
我不知道为什么TestDoubleStruct在RyuJIT上是如此异常。
码:
public readonly struct LongStruct
{
public readonly long Primitive;
public LongStruct(long value) => Primitive = value;
public static LongStruct Add(LongStruct lhs, LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
public static LongStruct AddWithIn(in LongStruct lhs, in LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
}
public readonly struct DoubleStruct
{
public readonly double Primitive;
public DoubleStruct(double value) => Primitive = value;
public static DoubleStruct Add(DoubleStruct lhs, DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
public static DoubleStruct AddWithIn(in DoubleStruct lhs, in DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
}
public class Benchmark
{
[Benchmark]
public void TestLong()
{
for (var i = 1000000000; i > 0; --i)
{
LongAdd(1, 2);
}
}
[Benchmark]
public void TestLongStruct()
{
var a = new LongStruct(1);
var b = new LongStruct(2);
for (var i = 1000000000; i > 0; --i)
{
LongStruct.Add(a, b);
}
}
[Benchmark]
public void TestLongStructWithIn()
{
var a = new LongStruct(1);
var b = new LongStruct(2);
for (var i = 1000000000; i > 0; --i)
{
LongStruct.AddWithIn(a, b);
}
}
[Benchmark]
public void TestDouble()
{
for (var i = 1000000000; i > 0; --i)
{
DoubleAdd(1, 2);
}
}
[Benchmark]
public void TestDoubleStruct()
{
var a = new DoubleStruct(1);
var b = new DoubleStruct(2);
for (var i = 1000000000; i > 0; --i)
{
DoubleStruct.Add(a, b);
}
}
[Benchmark]
public void TestDoubleStructWithIn()
{
var a = new DoubleStruct(1);
var b = new DoubleStruct(2);
for (var i = 1000000000; i > 0; --i)
{
DoubleStruct.AddWithIn(a, b);
}
}
public static long LongAdd(long lhs, long rhs) => lhs + rhs;
public static double DoubleAdd(double lhs, double rhs) => lhs + rhs;
}
class Program
{
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<Benchmark>();
Console.ReadLine();
}
}
为了好玩,以下是两种情况下的x64程序集:
码
using System;
public class C {
public long AddLongs(long a, long b) {
return a + b;
}
public LongStruct AddLongStructs(LongStruct a, LongStruct b) {
return LongStruct.Add(a, b);
}
public LongStruct AddLongStructsWithIn(LongStruct a, LongStruct b) {
return LongStruct.AddWithIn(a, b);
}
public double AddDoubles(double a, double b) {
return a + b;
}
public DoubleStruct AddDoubleStructs(DoubleStruct a, DoubleStruct b) {
return DoubleStruct.Add(a, b);
}
public DoubleStruct AddDoubleStructsWithIn(DoubleStruct a, DoubleStruct b) {
return DoubleStruct.AddWithIn(a, b);
}
}
public readonly struct LongStruct
{
public readonly long Primitive;
public LongStruct(long value) => Primitive = value;
public static LongStruct Add(LongStruct lhs, LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
public static LongStruct AddWithIn(in LongStruct lhs, in LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
}
public readonly struct DoubleStruct
{
public readonly double Primitive;
public DoubleStruct(double value) => Primitive = value;
public static DoubleStruct Add(DoubleStruct lhs, DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
public static DoubleStruct AddWithIn(in DoubleStruct lhs, in DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
}
x86汇编
C.AddLongs(Int64, Int64)
L0000: mov eax, [esp+0xc]
L0004: mov edx, [esp+0x10]
L0008: add eax, [esp+0x4]
L000c: adc edx, [esp+0x8]
L0010: ret 0x10
C.AddLongStructs(LongStruct, LongStruct)
L0000: push esi
L0001: mov eax, [esp+0x10]
L0005: mov esi, [esp+0x14]
L0009: add eax, [esp+0x8]
L000d: adc esi, [esp+0xc]
L0011: mov [edx], eax
L0013: mov [edx+0x4], esi
L0016: pop esi
L0017: ret 0x10
C.AddLongStructsWithIn(LongStruct, LongStruct)
L0000: push esi
L0001: mov eax, [esp+0x10]
L0005: mov esi, [esp+0x14]
L0009: add eax, [esp+0x8]
L000d: adc esi, [esp+0xc]
L0011: mov [edx], eax
L0013: mov [edx+0x4], esi
L0016: pop esi
L0017: ret 0x10
C.AddDoubles(Double, Double)
L0000: fld qword [esp+0xc]
L0004: fadd qword [esp+0x4]
L0008: ret 0x10
C.AddDoubleStructs(DoubleStruct, DoubleStruct)
L0000: fld qword [esp+0xc]
L0004: fld qword [esp+0x4]
L0008: faddp st1, st0
L000a: fstp qword [edx]
L000c: ret 0x10
C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
L0000: fld qword [esp+0xc]
L0004: fadd qword [esp+0x4]
L0008: fstp qword [edx]
L000a: ret 0x10
x64汇编
C..ctor()
L0000: ret
C.AddLongs(Int64, Int64)
L0000: lea rax, [rdx+r8]
L0004: ret
C.AddLongStructs(LongStruct, LongStruct)
L0000: lea rax, [rdx+r8]
L0004: ret
C.AddLongStructsWithIn(LongStruct, LongStruct)
L0000: lea rax, [rdx+r8]
L0004: ret
C.AddDoubles(Double, Double)
L0000: vzeroupper
L0003: vmovaps xmm0, xmm1
L0008: vaddsd xmm0, xmm0, xmm2
L000d: ret
C.AddDoubleStructs(DoubleStruct, DoubleStruct)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: mov [rsp+0x28], rdx
L000c: mov [rsp+0x30], r8
L0011: mov rax, [rsp+0x28]
L0016: mov [rsp+0x10], rax
L001b: mov rax, [rsp+0x30]
L0020: mov [rsp+0x8], rax
L0025: vmovsd xmm0, qword [rsp+0x10]
L002c: vaddsd xmm0, xmm0, [rsp+0x8]
L0033: vmovsd [rsp], xmm0
L0039: mov rax, [rsp]
L003d: add rsp, 0x18
L0041: ret
C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
L0000: push rax
L0001: vzeroupper
L0004: mov [rsp+0x18], rdx
L0009: mov [rsp+0x20], r8
L000e: vmovsd xmm0, qword [rsp+0x18]
L0015: vaddsd xmm0, xmm0, [rsp+0x20]
L001c: vmovsd [rsp], xmm0
L0022: mov rax, [rsp]
L0026: add rsp, 0x8
L002a: ret
如果你添加循环:
码
public class C {
public void AddLongs(long a, long b) {
for (var i = 1000000000; i > 0; --i) {
long c = a + b;
}
}
public void AddLongStructs(LongStruct a, LongStruct b) {
for (var i = 1000000000; i > 0; --i) {
a = LongStruct.Add(a, b);
}
}
public void AddLongStructsWithIn(LongStruct a, LongStruct b) {
for (var i = 1000000000; i > 0; --i) {
a = LongStruct.AddWithIn(a, b);
}
}
public void AddDoubles(double a, double b) {
for (var i = 1000000000; i > 0; --i) {
a = a + b;
}
}
public void AddDoubleStructs(DoubleStruct a, DoubleStruct b) {
for (var i = 1000000000; i > 0; --i) {
a = DoubleStruct.Add(a, b);
}
}
public void AddDoubleStructsWithIn(DoubleStruct a, DoubleStruct b) {
for (var i = 1000000000; i > 0; --i) {
a = DoubleStruct.AddWithIn(a, b);
}
}
}
public readonly struct LongStruct
{
public readonly long Primitive;
public LongStruct(long value) => Primitive = value;
public static LongStruct Add(LongStruct lhs, LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
public static LongStruct AddWithIn(in LongStruct lhs, in LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
}
public readonly struct DoubleStruct
{
public readonly double Primitive;
public DoubleStruct(double value) => Primitive = value;
public static DoubleStruct Add(DoubleStruct lhs, DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
public static DoubleStruct AddWithIn(in DoubleStruct lhs, in DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
}
86
C.AddLongs(Int64, Int64)
L0000: push ebp
L0001: mov ebp, esp
L0003: mov eax, 0x3b9aca00
L0008: dec eax
L0009: test eax, eax
L000b: jg L0008
L000d: pop ebp
L000e: ret 0x10
C.AddLongStructs(LongStruct, LongStruct)
L0000: push ebp
L0001: mov ebp, esp
L0003: push esi
L0004: mov esi, 0x3b9aca00
L0009: mov eax, [ebp+0x10]
L000c: mov edx, [ebp+0x14]
L000f: add eax, [ebp+0x8]
L0012: adc edx, [ebp+0xc]
L0015: mov [ebp+0x10], eax
L0018: mov [ebp+0x14], edx
L001b: dec esi
L001c: test esi, esi
L001e: jg L0009
L0020: pop esi
L0021: pop ebp
L0022: ret 0x10
C.AddLongStructsWithIn(LongStruct, LongStruct)
L0000: push ebp
L0001: mov ebp, esp
L0003: push esi
L0004: mov esi, 0x3b9aca00
L0009: mov eax, [ebp+0x10]
L000c: mov edx, [ebp+0x14]
L000f: add eax, [ebp+0x8]
L0012: adc edx, [ebp+0xc]
L0015: mov [ebp+0x10], eax
L0018: mov [ebp+0x14], edx
L001b: dec esi
L001c: test esi, esi
L001e: jg L0009
L0020: pop esi
L0021: pop ebp
L0022: ret 0x10
C.AddDoubles(Double, Double)
L0000: push ebp
L0001: mov ebp, esp
L0003: mov eax, 0x3b9aca00
L0008: dec eax
L0009: test eax, eax
L000b: jg L0008
L000d: pop ebp
L000e: ret 0x10
C.AddDoubleStructs(DoubleStruct, DoubleStruct)
L0000: push ebp
L0001: mov ebp, esp
L0003: mov eax, 0x3b9aca00
L0008: fld qword [ebp+0x10]
L000b: fld qword [ebp+0x8]
L000e: faddp st1, st0
L0010: fstp qword [ebp+0x10]
L0013: dec eax
L0014: test eax, eax
L0016: jg L0008
L0018: pop ebp
L0019: ret 0x10
C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
L0000: push ebp
L0001: mov ebp, esp
L0003: mov eax, 0x3b9aca00
L0008: fld qword [ebp+0x10]
L000b: fadd qword [ebp+0x8]
L000e: fstp qword [ebp+0x10]
L0011: dec eax
L0012: test eax, eax
L0014: jg L0008
L0016: pop ebp
L0017: ret 0x10
64位
C.AddLongs(Int64, Int64)
L0000: mov eax, 0x3b9aca00
L0005: dec eax
L0007: test eax, eax
L0009: jg L0005
L000b: ret
C.AddLongStructs(LongStruct, LongStruct)
L0000: mov eax, 0x3b9aca00
L0005: add rdx, r8
L0008: dec eax
L000a: test eax, eax
L000c: jg L0005
L000e: ret
C.AddLongStructsWithIn(LongStruct, LongStruct)
L0000: mov eax, 0x3b9aca00
L0005: add rdx, r8
L0008: dec eax
L000a: test eax, eax
L000c: jg L0005
L000e: ret
C.AddDoubles(Double, Double)
L0000: vzeroupper
L0003: mov eax, 0x3b9aca00
L0008: vaddsd xmm1, xmm1, xmm2
L000d: dec eax
L000f: test eax, eax
L0011: jg L0008
L0013: ret
C.AddDoubleStructs(DoubleStruct, DoubleStruct)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: mov [rsp+0x28], rdx
L000c: mov [rsp+0x30], r8
L0011: mov eax, 0x3b9aca00
L0016: mov rdx, [rsp+0x28]
L001b: mov [rsp+0x10], rdx
L0020: mov rdx, [rsp+0x30]
L0025: mov [rsp+0x8], rdx
L002a: vmovsd xmm0, qword [rsp+0x10]
L0031: vaddsd xmm0, xmm0, [rsp+0x8]
L0038: vmovsd [rsp], xmm0
L003e: mov rdx, [rsp]
L0042: mov [rsp+0x28], rdx
L0047: dec eax
L0049: test eax, eax
L004b: jg L0016
L004d: add rsp, 0x18
L0051: ret
C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
L0000: push rax
L0001: vzeroupper
L0004: mov [rsp+0x18], rdx
L0009: mov [rsp+0x20], r8
L000e: mov eax, 0x3b9aca00
L0013: vmovsd xmm0, qword [rsp+0x20]
L001a: vmovaps xmm1, xmm0
L001f: vaddsd xmm1, xmm1, [rsp+0x18]
L0026: vmovsd [rsp], xmm1
L002c: mov rdx, [rsp]
L0030: mov [rsp+0x18], rdx
L0035: dec eax
L0037: test eax, eax
L0039: jg L001a
L003b: add rsp, 0x8
L003f: ret
我不太熟悉汇编来解释它到底在做什么,但很明显AddDoubleStructs
比AddLongStructs
更多的工作。
请参阅@ canton7的答案,了解一些时序结果和x86 asm输出,这是我根据我的结论得出的。 (我没有Windows或C#编译器)。
异常:SharpLab上的循环“释放”asm与任何Intel或AMD CPU的@ canton7的BenchmarkDotNet性能数字不匹配。 asm显示TestDouble
确实在循环内部执行了a+=b
,但是时间显示它的运行速度与1 / clock整数循环一样快。 (所有AMD K8 / K10 / Bulldozer系列/ Ryzen以及英特尔P6通过Skylake的FP添加延迟为3到5个周期。)
也许这只是第一次通过优化,并且在运行更长时间后,JIT将完全优化FP添加(因为不返回值)。所以我认为不幸的是我们仍然没有真正运行的asm,但我们可以看到JIT优化器所造成的那种混乱。
我不明白TestDoubleStructWithIn
如何比整数循环慢,但只有两倍慢(不是3x),除非long
循环不是每个时钟1次迭代运行。如此高的计数,启动开销应该可以忽略不计。保存在内存中的循环计数器可以解释它(每个迭代都会对所有内容施加约6个循环,隐藏除了非常慢的FP版本之外的任何事件的延迟。)但是@ canton7表示他们使用Release版本进行了测试。但由于功率/热限制,他们的i7-8650U可能无法为所有环路保持max-turbo = 4.20 GHz。 (所有核心最小持续频率= 1.90 GHz),所以以秒为单位而不是周期时间查看可能会让我们失去一个没有瓶颈的循环?这仍然没有解释原始的双倍速度与同样的速度;那些必须已经优化了。
期望这个类内联和优化,你使用它的方式是合理的。一个好的编译器会这样做。但JIT必须快速编译,所以它并不总是好的,显然在这种情况下不适用于double
。
对于整数循环,x86-64上的64位整数加法具有1个周期延迟,而现代超标量CPU具有足够的吞吐量来运行包含加法的循环,其速度与仅计数计数器的其他空循环相同。所以我们无法从时间上看出编译器是否在循环外执行了a + b * 1000000000
(但仍然运行了一个空循环),或者是什么。
@ canton7使用SharpLab来查看JIT x86-64 asm以获取独立版本的AddDoubleStructs
,以及用于调用它的循环。 standalone and loops, x86-64, release mode。
我们可以看到,对于原始的long c = a + b
,它完全优化了添加(但保留了一个空的倒计时循环)!如果我们使用a = a+b;
,我们得到一个实际的add
指令,即使a
没有从函数返回。
loops.AddLongs(Int64, Int64)
L0000: mov eax, 0x3b9aca00 # i = init
# do {
# long c = a+b optimized out
L0005: dec eax # --i;
L0007: test eax, eax
L0009: jg L0005 # }while(i>0);
L000b: ret
但结构版本有add
的实际a = LongStruct.Add(a, b);
指令。 (我们的确与a = a+b;
和原始的long
一样。)
loops.AddLongStructs(LongStruct a, LongStruct b)
L0000: mov eax, 0x3b9aca00
L0005: add rdx, r8 # a += b; other insns are identical
L0008: dec eax
L000a: test eax, eax
L000c: jg L0005
L000e: ret
但是如果我们将它改为LongStruct.Add(a, b);
(不在任何地方指定结果),我们在循环外得到L0006: add rdx, r8
(提升a + b),然后在循环内得到L0009: mov rcx, rdx
/ L000c: mov [rsp], rcx
。 (注册副本然后存储到一个死的临时空间,完全是疯了。)在C#中(与C / C ++不同),将a+b;
自己写为语句是一个错误,所以我们无法看到原始的等价物是否仍会产生用愚蠢的浪费指示。 Only assignment, call, increment, decrement, await, and new object expressions can be used as a statement
。
我认为我们不能责怪任何这些错过的结构本身优化。但即使您在循环中使用/不使用add
对此进行基准测试,也不会导致现代x86上此循环的实际减速。空循环击中了1 /时钟循环吞吐量瓶颈,循环上只有2个uops(dec
和宏融合test/jg
),只要它们没有引入任何比1更差的瓶颈,就可以留出2个以上的微量空间而不会减速。时钟。 (https://agner.org/optimize/)例如具有3个周期延迟的imul edx, r8d
会使环路速度降低3倍。“4 uops”前端吞吐量假定是最近的英特尔。推土机系列较窄,Ryzen是5宽。
这些是类的非静态成员函数(无缘无故,但我没有立即注意到,所以现在不改变它)。在asm调用约定中,第一个arg(RCX)是一个this
指针,而args 2和3是成员函数(RDX和R8)的显式args。
根据test eax,eax
,JIT代码在dec eax
之后添加了额外的i - 1
,dec eax
已经设置了FLAGS(除了我们没有测试的CF之外)。起点是正编译时间常数;任何C编译器都会将其优化为jnz
/ dec eax
。我认为jg
/ dec
也会起作用,当1 > 1
产生零时,因为double
是假的。
C#在x86-64上使用的调用约定在整数寄存器中传递8字节结构,这对于包含vaddsd
的结构很糟糕(因为它必须被反弹到XMM寄存器以用于### stand-alone versions of functions: not inlined into a loop
# with primitive double, args are passed in XMM regs
standalone.AddDoubles(Double, Double)
L0000: vzeroupper
L0003: vmovaps xmm0, xmm1 # stupid missed optimization defeating the purpose of AVX 3-operand instructions
L0008: vaddsd xmm0, xmm0, xmm2 # vaddsd xmm0, xmm1, xmm2 would do retval = a + b
L000d: ret
# without `in`. Significantly less bad with `in`, see the link.
standalone.AddDoubleStructs(DoubleStruct a, DoubleStruct b)
L0000: sub rsp, 0x18 # reserve 24 bytes of stack space
L0004: vzeroupper # Weird to use this in a function that doesn't have any YMM vectors...
L0007: mov [rsp+0x28], rdx # spill args 2 (rdx=double a) and 3 (r8=double b) to the stack.
L000c: mov [rsp+0x30], r8 # (first arg = rcx = unused this pointer)
L0011: mov rax, [rsp+0x28]
L0016: mov [rsp+0x10], rax # copy a to another place on the stack!
L001b: mov rax, [rsp+0x30]
L0020: mov [rsp+0x8], rax # copy b to another place on the stack!
L0025: vmovsd xmm0, qword [rsp+0x10]
L002c: vaddsd xmm0, xmm0, [rsp+0x8] # add a and b in the SSE/AVX FPU
L0033: vmovsd [rsp], xmm0 # store the result to yet another stack location
L0039: mov rax, [rsp] # reload it into RAX, the return value
L003d: add rsp, 0x18
L0041: ret
或其他FP操作)。因此,对于非内联函数调用,您的结构有一个不可避免的缺点。
in
这完全是疯了。这是release-mode code-gen,但是编译器将结构存储到内存中,然后在实际将它们加载到FPU之前重新加载+存储它们。 (我猜int-> int copy可能是一个构造函数,但我不知道。我通常会看C / C ++编译器输出,这在优化版本中通常不是这么愚蠢)。
在函数arg上使用vmovq xmm0, rdx
可以避免将每个输入的额外副本复制到第二个堆栈位置,但它仍然会通过存储/重新加载将它们从整数传输到XMM。
这就是gcc对int-> xmm做的默认调整,但这是一个错过的优化。 Agner Fog说(在他的微指南中)AMD的优化手册建议在调整Bulldozer时存储/重新加载,但他发现即使在AMD上它也不会更快。 (其中ALU int-> xmm具有~10个周期延迟,而在Intel或Ryzen上具有2到3个周期,1 /时吞吐量与存储相同。)
这个函数的一个很好的实现(如果我们坚持使用调用约定)将是vmovq xmm1, r8
/ vmovq rax, xmm0
,然后是vaddsd然后ret
/ double
。
原始long
与double c = a + b;
类似地优化:
a = a + b
完全优化vaddsd
(使用@ canton7)仍然没有,即使结果仍未使用。这将成为loops.AddDoubles(Double, Double)
L0000: vzeroupper
L0003: mov eax, 0x3b9aca00
# do {
L0008: vaddsd xmm1, xmm1, xmm2 # a += b
L000d: dec eax # --i
L000f: test eax, eax
L0011: jg L0008 # }while(i>0);
L0013: ret
延迟的瓶颈(3到5个周期取决于Bulldozer vs. Ryzen vs. Intel pre-Skylake vs. Skylake。)但它确实留在了寄存器中。a
在将函数内联到循环中之后,所有存储/重载开销都应该消失;这是内联点的很大一部分。令人惊讶的是,它没有优化。 2x存储/重载是循环传输数据依赖链的关键路径(FP添加)!!!这是一个巨大的错过优化。
在现代英特尔上,存储/重新加载延迟大约是5或6个周期,比FP添加慢。 loops.AddDoubleStructs(DoubleStruct, DoubleStruct)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: mov [rsp+0x28], rdx # spill function args: a
L000c: mov [rsp+0x30], r8 # and b
L0011: mov eax, 0x3b9aca00 # i= init
# do {
L0016: mov rdx, [rsp+0x28]
L001b: mov [rsp+0x10], rdx # tmp_a = copy a to another local
L0020: mov rdx, [rsp+0x30]
L0025: mov [rsp+0x8], rdx # tmp_b = copy b
L002a: vmovsd xmm0, qword [rsp+0x10] # tmp_a
L0031: vaddsd xmm0, xmm0, [rsp+0x8] # + tmp_b
L0038: vmovsd [rsp], xmm0 # tmp_a = sum
L003e: mov rdx, [rsp]
L0042: mov [rsp+0x28], rdx # a = copy tmp_a
L0047: dec eax # --i;
L0049: test eax, eax
L004b: jg L0016 # }while(i>0)
L004d: add rsp, 0x18
L0051: ret
正在加载/存储到XMM0的路上,然后再回来的路上。
double
原始的long
循环优化为一个简单的循环,将所有内容保存在寄存器中,没有能够违反严格FP的巧妙优化。即不将其转换为乘法,或使用多个累加器来隐藏FP添加延迟。 (但我们从addsd
版本中知道,无论如何,编译器都不会做任何更好的事情。)它将所有的增加作为一个长的依赖链,所以每3个(Broadwell或更早,Ryzen)或4个周期(Skylake)一个qazxswpoi。