x86-64 执行地址计算 mov 即 mov i(r, r, i), r 在端口 1 上执行吗？还是还是p0156？

Question

我问是否有

mov

指令需要计算该地址，即（在at&t语法中

mov i(r, r, i), reg

或

mov reg, i(r, reg, i)

必须在端口 1 上执行，因为它们实际上是一个带有 3 个操作数 + MOV 的 LEA，或者如果它们可以在端口 0156 上自由执行。

如果他们确实在端口 1 上执行 LEA 部分，那么一旦地址计算完成，端口 1 是否会被解锁，或者是否需要首先完成整个内存加载。

在ICL上好像p7可以做索引地址模式？

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>


#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))


#define TERMS 3

void BENCH_ATTR
test_store_port() {
    const uint32_t N = (1 << 29);

    uint64_t dst, loop_cnt;
    uint64_t src[16] __attribute__((aligned(64)));

    asm volatile(
        "movl %[N], %k[loop_cnt]\n\t"
        ".p2align 5\n\t"
        "1:\n\t"

        "movl %k[loop_cnt], %k[dst]\n\t"
        "andl $15, %k[dst]\n\t"
#if TERMS == 3
        "movl %k[dst], (%[src], %[dst], 4)\n\t"
#else
        "movl %k[dst], (%[src])\n\t"
#endif


        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        : [ dst ] "+r"(dst), [ loop_cnt ] "+r"(loop_cnt)
        : [ N ] "i"(N), [ src ] "r"(src), "m"(*((const uint32_t(*)[16])src))
        : "cc");
}

int
main(int argc, char ** argv) {
    test_store_port();
}

结果

#define TERMS 3

:

perf stat -e uops_dispatched.port_2_3 -e uops_dispatched.port_7_8 -e uops_issued.any -e cpu-cycles ./bsf_dep

 Performance counter stats for './bsf_dep':

           297,191      uops_dispatched.port_2_3                                    
       537,039,830      uops_dispatched.port_7_8                                    
     2,149,098,661      uops_issued.any                                             
       761,661,276      cpu-cycles                                                  

       0.210463841 seconds time elapsed

       0.210366000 seconds user
       0.000000000 seconds sys

结果

#define TERMS 1

:

perf stat -e uops_dispatched.port_2_3 -e uops_dispatched.port_7_8 -e uops_issued.any -e cpu-cycles ./bsf_dep

 Performance counter stats for './bsf_dep':

           291,370      uops_dispatched.port_2_3                                    
       537,040,822      uops_dispatched.port_7_8                                    
     2,148,947,408      uops_issued.any                                             
       761,476,510      cpu-cycles                                                  

       0.202235307 seconds time elapsed

       0.202209000 seconds user
       0.000000000 seconds sys

Answer 1

所有 CPU 都在加载或存储地址端口中的 AGU 上（而不是在 ALU 端口上）为加载/存储微指令进行地址生成。只有 LEA 使用 ALU 执行端口来进行移位加法运算。

如果复杂寻址模式需要端口 1，https://uops.info/ 和/或 https://agner.org/optimize/ 会在其指令表中如此说明。但他们不需要：加载只需要 p23，并且只存储 p237 用于存储地址 + p4 用于存储数据。

https://www.realworldtech.com/haswell-cpu/5/ 显示了端口 2 和 3 上的加载/AGU 执行单元，以及端口 7 上的存储 AGU。该确切布局适用于 Haswell 到 Skylake。

实际上只是索引商店的p23；端口 7 上的简单存储地址 AGU（Haswell 通过 Skylake）只能处理 reg+constant，这意味着如果您在代码中使用索引寻址模式，则地址生成可能会成为瓶颈，否则每个时钟可以维持 2 个加载 + 1 个存储。

（早期的 Sandybridge 系列，SnB 和 IvB，甚至会未层压索引存储，因此也存在前端成本。）

Ice Lake 改变了这一点，在端口 7 和 8 上有 2 个专用存储 AGU。存储地址 uops 不能再借用负载 AGU，因此存储 AGU 必须具有全功能。 https://uops.info/html-tp/ICL/MOV_M32_R32-Measurements.html 确认具有索引寻址模式的存储确实在 ICL 上的 2/时钟运行，因此两个存储 AGU 都是全功能的。例如

mov  [r14+r13*1+0x4],r8d

。（uops.info 没有测试比例因子 > 1，但我假设两个存储 AGU 是相同的，在这种情况下它们都会处理它。）

不幸的是，要过很多年，HSW/SKL 对于调优来说不再重要，因为英特尔仍在销售源自 Skylake 的微架构，因此它们将在未来几年内成为桌面软件安装基础的很大一部分。

x86-64 执行地址计算 mov 即 mov i(r, r, i), r 在端口 1 上执行吗？还是还是p0156？

问题描述投票：0回答：1

1个回答

最新问题

x86-64 执行地址计算 mov 即 mov i(r, r, i), r 在端口 1 上执行吗？还是还是p0156？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1