来自 INTEL CPU 手册(Intel® 64 和 IA-32 架构软件开发人员手册第 3 卷(3A、3B、3C 和 3D):系统编程指南 8.1.1)中说“非对齐数据访问将严重影响 CPU 的性能”。处理器”。然后我做了一个测试来证明这一点。但结果是对齐和非对齐数据访问具有相同的性能。为什么???有人可以帮忙吗?我的代码如下所示:
#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
using namespace std;
static inline int64_t get_time_ns()
{
std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
return a.count();
}
int main(int argc, char** argv)
{
if (argc < 2) {
cout << "Usage:./test [01234567]" << endl;
cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
return 0;
}
uint64_t offset = atoi(argv[1]);
cout << "offset = " << offset << endl;
const uint64_t BUFFER_SIZE = 800000000;
uint8_t* data_ptr = new uint8_t[BUFFER_SIZE];
if (data_ptr == nullptr) {
cout << "apply for memory failed" << endl;
return 0;
}
memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
const uint64_t LOOP_CNT = 300;
cout << "start" << endl;
auto start = get_time_ns();
for (uint64_t i = 0; i < LOOP_CNT; ++i) {
for (uint64_t j = offset; j <= BUFFER_SIZE - 8; j+= 8) { // align:offset = 0 nonalign: offset=1-7
volatile auto tmp = *(uint64_t*)&data_ptr[j]; // read from memory
//mov rax,QWORD PTR [rbx+rdx*1] // rbx+rdx*1 = 0x7fffc76fe019
//mov QWORD PTR [rsp+0x8],rax
++tmp;
//mov rcx,QWORD PTR [rsp+0x8]
//add rcx,0x1
//mov QWORD PTR [rsp+0x8],rcx
*(uint64_t*)&data_ptr[j] = tmp; // write to memory
//mov rcx,QWORD PTR [rbx+rdx*1],rcx
}
}
auto end = get_time_ns();
cout << "time elapse " << end - start << "ns" << endl;
return 0;
}
结果:
offset = 0
start
time elapse 32991486013ns
offset = 1
start
time elapse 34089866539ns
offset = 2
start
time elapse 34011790606ns
offset = 3
start
time elapse 34097518021ns
offset = 4
start
time elapse 34166815472ns
offset = 5
start
time elapse 34081477780ns
offset = 6
start
time elapse 34158804869ns
offset = 7
start
time elapse 34163037004ns
在大多数现代 x86 内核上,对齐和未对齐的性能是相同的 仅当访问不跨越特定的 内部边界 时。
内部边界的确切大小根据相关 CPU 的核心架构而有所不同,但在过去十年的 Intel CPU 上,相关边界是 64 字节缓存线。也就是说,完全落在 64 字节缓存行内的访问执行相同的操作,无论它们是否对齐。如果(必然未对齐的)访问
跨越英特尔芯片上的缓存行边界,那么延迟和吞吐量都会受到约 2 倍的损失。这种惩罚的底线影响取决于周围的代码,通常远小于 2 倍,有时接近于零。如果也跨越 4K 页面边界,这种适度的惩罚可能会大得多。
对齐访问永远不会跨越这些边界,因此不会遭受这种惩罚。AMD 芯片的总体情况类似,尽管在某些最新芯片上相关边界小于 64 字节,并且加载和存储的边界不同。
我在我撰写的博客文章的
负载吞吐量和存储吞吐量部分中包含了更多详细信息。
测试一下
volatile auto tmp
和
tmp++
在堆栈上创建一个易失性,并进行大量加载和存储以保留易失性语义:这些都是对齐的,并且会消除您尝试通过测试测量的效果。
每个 存储将被分割(如果有的话):
#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
#include <iomanip>
using namespace std;
static inline int64_t get_time_ns()
{
std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
return a.count();
}
int main(int argc, char** argv)
{
if (argc < 2) {
cout << "Usage:./test [01234567]" << endl;
cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
return 0;
}
uint64_t offset = atoi(argv[1]);
const uint64_t BUFFER_SIZE = 10000;
alignas(64) uint8_t data_ptr[BUFFER_SIZE];
memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
const uint64_t LOOP_CNT = 1000000;
auto start = get_time_ns();
for (uint64_t i = 0; i < LOOP_CNT; ++i) {
uint64_t src = rand();
for (uint64_t j = offset; j + 64<= BUFFER_SIZE; j+= 64) { // align:offset = 0 nonalign: offset=1-7
memcpy(data_ptr + j, &src, 8);
}
}
auto end = get_time_ns();
cout << "time elapsed " << std::setprecision(2) << (end - start) / ((double)LOOP_CNT * BUFFER_SIZE / 64) <<
"ns per write (rand:" << (int)data_ptr[rand() % BUFFER_SIZE] << ")" << endl;
return 0;
}
对 0 到 64 的所有对齐运行此命令,我得到:
$ g++ test.cpp -O2 && for off in {0..64}; do printf "%2d :" $off && ./a.out $off; done
0 :time elapsed 0.56ns per write (rand:0)
1 :time elapsed 0.57ns per write (rand:0)
2 :time elapsed 0.57ns per write (rand:0)
3 :time elapsed 0.56ns per write (rand:0)
4 :time elapsed 0.56ns per write (rand:0)
5 :time elapsed 0.56ns per write (rand:0)
6 :time elapsed 0.57ns per write (rand:0)
7 :time elapsed 0.56ns per write (rand:0)
8 :time elapsed 0.57ns per write (rand:0)
9 :time elapsed 0.57ns per write (rand:0)
10 :time elapsed 0.57ns per write (rand:0)
11 :time elapsed 0.56ns per write (rand:0)
12 :time elapsed 0.56ns per write (rand:0)
13 :time elapsed 0.56ns per write (rand:0)
14 :time elapsed 0.56ns per write (rand:0)
15 :time elapsed 0.57ns per write (rand:0)
16 :time elapsed 0.56ns per write (rand:0)
17 :time elapsed 0.56ns per write (rand:0)
18 :time elapsed 0.56ns per write (rand:0)
19 :time elapsed 0.56ns per write (rand:0)
20 :time elapsed 0.56ns per write (rand:0)
21 :time elapsed 0.56ns per write (rand:0)
22 :time elapsed 0.56ns per write (rand:0)
23 :time elapsed 0.56ns per write (rand:0)
24 :time elapsed 0.56ns per write (rand:0)
25 :time elapsed 0.56ns per write (rand:0)
26 :time elapsed 0.56ns per write (rand:0)
27 :time elapsed 0.56ns per write (rand:0)
28 :time elapsed 0.57ns per write (rand:0)
29 :time elapsed 0.56ns per write (rand:0)
30 :time elapsed 0.57ns per write (rand:25)
31 :time elapsed 0.56ns per write (rand:151)
32 :time elapsed 0.56ns per write (rand:123)
33 :time elapsed 0.56ns per write (rand:29)
34 :time elapsed 0.55ns per write (rand:0)
35 :time elapsed 0.56ns per write (rand:0)
36 :time elapsed 0.57ns per write (rand:0)
37 :time elapsed 0.56ns per write (rand:0)
38 :time elapsed 0.56ns per write (rand:0)
39 :time elapsed 0.56ns per write (rand:0)
40 :time elapsed 0.56ns per write (rand:0)
41 :time elapsed 0.56ns per write (rand:0)
42 :time elapsed 0.57ns per write (rand:0)
43 :time elapsed 0.56ns per write (rand:0)
44 :time elapsed 0.56ns per write (rand:0)
45 :time elapsed 0.56ns per write (rand:0)
46 :time elapsed 0.57ns per write (rand:0)
47 :time elapsed 0.57ns per write (rand:0)
48 :time elapsed 0.56ns per write (rand:0)
49 :time elapsed 0.56ns per write (rand:0)
50 :time elapsed 0.57ns per write (rand:0)
51 :time elapsed 0.56ns per write (rand:0)
52 :time elapsed 0.56ns per write (rand:0)
53 :time elapsed 0.56ns per write (rand:0)
54 :time elapsed 0.55ns per write (rand:0)
55 :time elapsed 0.56ns per write (rand:0)
56 :time elapsed 0.56ns per write (rand:0)
57 :time elapsed 1.1ns per write (rand:0)
58 :time elapsed 1.1ns per write (rand:0)
59 :time elapsed 1.1ns per write (rand:0)
60 :time elapsed 1.1ns per write (rand:0)
61 :time elapsed 1.1ns per write (rand:0)
62 :time elapsed 1.1ns per write (rand:0)
63 :time elapsed 1ns per write (rand:0)
64 :time elapsed 0.56ns per write (rand:0)
请注意,偏移量 57 到 63 的每次写入时间都约为 2 倍,而这些正是 8 字节写入跨越 64 字节(缓存行)边界的偏移量。