为什么对齐访问和非对齐访问具有相同的性能?

问题描述 投票:0回答:1

来自 Intel CPU 手册(Intel® 64 和 IA-32 架构软件开发人员手册第 3 卷(3A、3B、3C 和 3D):系统编程指南 8.1.1),它说 “非对齐数据访问将严重影响处理器”。然后我做了一个测试来证明这一点,但结果是对齐和非对齐数据访问具有相同的性能。为什么?有人可以帮忙吗?我的代码如下所示:

#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
using namespace std;

static inline int64_t get_time_ns()
{
    std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
    return a.count();
}
int main(int argc, char** argv)
{
    if (argc < 2) {
        cout << "Usage:./test [01234567]" << endl;
        cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
        return 0;
    }
    uint64_t offset = atoi(argv[1]);
    cout << "offset = " << offset << endl;
    const uint64_t BUFFER_SIZE = 800000000;
    uint8_t* data_ptr = new uint8_t[BUFFER_SIZE];
    if (data_ptr == nullptr) {
        cout << "apply for memory failed" << endl;
        return 0;
    }
    memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
    const uint64_t LOOP_CNT = 300;
    cout << "start" << endl;
    auto start = get_time_ns();
    for (uint64_t i = 0; i < LOOP_CNT; ++i) {
        for (uint64_t j = offset; j <= BUFFER_SIZE - 8; j+= 8) { // align:offset = 0 nonalign: offset=1-7
            volatile auto tmp = *(uint64_t*)&data_ptr[j]; // read from memory
            //mov rax,QWORD PTR [rbx+rdx*1] // rbx+rdx*1 = 0x7fffc76fe019 
            //mov QWORD PTR [rsp+0x8],rax 
            ++tmp;
            //mov rcx,QWORD PTR [rsp+0x8] 
            //add rcx,0x1 
            //mov QWORD PTR [rsp+0x8],rcx
            *(uint64_t*)&data_ptr[j] = tmp; // write to memory
            //mov rcx,QWORD PTR [rbx+rdx*1],rcx
        }
    }
    auto end = get_time_ns();
    cout << "time elapse " << end - start << "ns" << endl;
    return 0;
}

结果:

offset = 0
start
time elapse 32991486013ns
offset = 1
start
time elapse 34089866539ns
offset = 2
start
time elapse 34011790606ns
offset = 3
start
time elapse 34097518021ns
offset = 4
start
time elapse 34166815472ns
offset = 5
start
time elapse 34081477780ns
offset = 6
start
time elapse 34158804869ns
offset = 7
start
time elapse 34163037004ns
x86-64 intel memory-alignment
1个回答
3
投票

在大多数现代 x86 内核上,对齐和未对齐的性能是相同的 仅当访问不跨越特定的 内部边界 时。

内部边界的确切大小根据相关 CPU 的核心架构而有所不同,但在过去十年的 Intel CPU 上,相关边界是 64 字节缓存线。也就是说,完全落在 64 字节缓存行内的访问执行相同的操作,无论它们是否对齐。

如果(必然未对齐的)访问

跨越英特尔芯片上的缓存行边界,那么延迟和吞吐量都会受到约 2 倍的损失。这种惩罚的底线影响取决于周围的代码,通常远小于 2 倍,有时接近于零。如果也跨越 4K 页面边界,这种适度的惩罚可能会大得多。

对齐访问永远不会跨越这些边界,因此不会遭受这种惩罚。

AMD 芯片的总体情况类似,尽管在某些最新芯片上相关边界小于 64 字节,并且加载和存储的边界不同。

我在我撰写的博客文章的

负载吞吐量存储吞吐量部分中包含了更多详细信息。

测试一下

您的测试未能显示效果,有以下几个原因:

    测试没有分配对齐的内存,您无法通过使用来自具有未知对齐方式的区域的偏移量来可靠地跨越缓存行。
  • 您一次迭代 8 个字节,因此大多数写入(8 次中的 7 次)将落入高速缓存行中,任何操作都不会受到任何惩罚,从而导致一个小信号,只有在基准测试的其余部分非常严重时才能检测到该信号。干净。
  • 您使用了较大的缓冲区大小,这不适合任何级别的缓存。分割线效果仅在 L1 处相当明显,或者当分割线意味着您引入两倍的线数时(例如,随机访问)。由于在任何一种情况下都线性访问每一行,因此无论是否进行拆分,您都将受到从 DRAM 到核心的吞吐量的限制:拆分写入在等待主内存时有足够的时间来完成。
  • 您使用本地
  • volatile auto tmp
    tmp++
     在堆栈上创建一个易失性,并进行大量加载和存储以保留易失性语义:这些都是对齐的,并且会消除您尝试通过测试测量的效果。 
这是我对您的测试的修改,仅在 L1 区域中运行,并且一次前进 64 个字节,因此

每个 存储将被分割(如果有的话):

#include <iostream> #include <stdint.h> #include <time.h> #include <chrono> #include <string.h> #include <iomanip> using namespace std; static inline int64_t get_time_ns() { std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch(); return a.count(); } int main(int argc, char** argv) { if (argc < 2) { cout << "Usage:./test [01234567]" << endl; cout << "0 - aligned, 1-7 - nonaligned offset" << endl; return 0; } uint64_t offset = atoi(argv[1]); const uint64_t BUFFER_SIZE = 10000; alignas(64) uint8_t data_ptr[BUFFER_SIZE]; memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE); const uint64_t LOOP_CNT = 1000000; auto start = get_time_ns(); for (uint64_t i = 0; i < LOOP_CNT; ++i) { uint64_t src = rand(); for (uint64_t j = offset; j + 64<= BUFFER_SIZE; j+= 64) { // align:offset = 0 nonalign: offset=1-7 memcpy(data_ptr + j, &src, 8); } } auto end = get_time_ns(); cout << "time elapsed " << std::setprecision(2) << (end - start) / ((double)LOOP_CNT * BUFFER_SIZE / 64) << "ns per write (rand:" << (int)data_ptr[rand() % BUFFER_SIZE] << ")" << endl; return 0; }
对 0 到 64 的所有对齐运行此命令,我得到:

$ g++ test.cpp -O2 && for off in {0..64}; do printf "%2d :" $off && ./a.out $off; done 0 :time elapsed 0.56ns per write (rand:0) 1 :time elapsed 0.57ns per write (rand:0) 2 :time elapsed 0.57ns per write (rand:0) 3 :time elapsed 0.56ns per write (rand:0) 4 :time elapsed 0.56ns per write (rand:0) 5 :time elapsed 0.56ns per write (rand:0) 6 :time elapsed 0.57ns per write (rand:0) 7 :time elapsed 0.56ns per write (rand:0) 8 :time elapsed 0.57ns per write (rand:0) 9 :time elapsed 0.57ns per write (rand:0) 10 :time elapsed 0.57ns per write (rand:0) 11 :time elapsed 0.56ns per write (rand:0) 12 :time elapsed 0.56ns per write (rand:0) 13 :time elapsed 0.56ns per write (rand:0) 14 :time elapsed 0.56ns per write (rand:0) 15 :time elapsed 0.57ns per write (rand:0) 16 :time elapsed 0.56ns per write (rand:0) 17 :time elapsed 0.56ns per write (rand:0) 18 :time elapsed 0.56ns per write (rand:0) 19 :time elapsed 0.56ns per write (rand:0) 20 :time elapsed 0.56ns per write (rand:0) 21 :time elapsed 0.56ns per write (rand:0) 22 :time elapsed 0.56ns per write (rand:0) 23 :time elapsed 0.56ns per write (rand:0) 24 :time elapsed 0.56ns per write (rand:0) 25 :time elapsed 0.56ns per write (rand:0) 26 :time elapsed 0.56ns per write (rand:0) 27 :time elapsed 0.56ns per write (rand:0) 28 :time elapsed 0.57ns per write (rand:0) 29 :time elapsed 0.56ns per write (rand:0) 30 :time elapsed 0.57ns per write (rand:25) 31 :time elapsed 0.56ns per write (rand:151) 32 :time elapsed 0.56ns per write (rand:123) 33 :time elapsed 0.56ns per write (rand:29) 34 :time elapsed 0.55ns per write (rand:0) 35 :time elapsed 0.56ns per write (rand:0) 36 :time elapsed 0.57ns per write (rand:0) 37 :time elapsed 0.56ns per write (rand:0) 38 :time elapsed 0.56ns per write (rand:0) 39 :time elapsed 0.56ns per write (rand:0) 40 :time elapsed 0.56ns per write (rand:0) 41 :time elapsed 0.56ns per write (rand:0) 42 :time elapsed 0.57ns per write (rand:0) 43 :time elapsed 0.56ns per write (rand:0) 44 :time elapsed 0.56ns per write (rand:0) 45 :time elapsed 0.56ns per write (rand:0) 46 :time elapsed 0.57ns per write (rand:0) 47 :time elapsed 0.57ns per write (rand:0) 48 :time elapsed 0.56ns per write (rand:0) 49 :time elapsed 0.56ns per write (rand:0) 50 :time elapsed 0.57ns per write (rand:0) 51 :time elapsed 0.56ns per write (rand:0) 52 :time elapsed 0.56ns per write (rand:0) 53 :time elapsed 0.56ns per write (rand:0) 54 :time elapsed 0.55ns per write (rand:0) 55 :time elapsed 0.56ns per write (rand:0) 56 :time elapsed 0.56ns per write (rand:0) 57 :time elapsed 1.1ns per write (rand:0) 58 :time elapsed 1.1ns per write (rand:0) 59 :time elapsed 1.1ns per write (rand:0) 60 :time elapsed 1.1ns per write (rand:0) 61 :time elapsed 1.1ns per write (rand:0) 62 :time elapsed 1.1ns per write (rand:0) 63 :time elapsed 1ns per write (rand:0) 64 :time elapsed 0.56ns per write (rand:0)
请注意,偏移量 57 到 63 的每次写入时间都约为 2 倍,而这些正是 8 字节写入跨越 64 字节(缓存行)边界的偏移量。

© www.soinside.com 2019 - 2024. All rights reserved.