如何改进大型 uint64 数组的异或运算？

Question

我想要对大型移位数组进行异或，以下是该函数的可移植版本，以便于解释。我如何改进这个计算？我尝试过使用 AVX2 但没有看到太大的改进。目前，对于下面示例中显示的数据库，处理所有内容需要 50 毫秒，即 12 GB/s，我将不胜感激任何改进计算的提示。

#include <iostream>

uint64_t partition_size = 4096;
uint64_t entry_size = 256; // bits
uint64_t DB_size = 16777216;
uint64_t *DB = new uint64_t[DB_size * entry_size/64];
uint64_t *result = new uint64_t[partition_size];

//partition_index will be a random multiple of partition_size, e.g. 0, 8192, 4096 etc
//random_offset will be a random number in [0, partition_size]
void xor_shifted_arrays(uint32_t partition_index, uint32_t random_offset)
{
    auto uint64_per_entry = entry_size / sizeof(uint64_t);

    int shift_offset;
    uint32_t shift;
    
    for (int i = 0; i < partition_size  ; i = i + 1)
    {
        shift = (i + random_offset) & (partition_size - 1);
        shift_offset = shift * uint64_per_entry;
        
        for (int j = 0; j < uint64_per_entry; j=j+1){
            result[shift_offset + j] = result[shift_offset + j] ^ DB[partition_index + j];  
        }
        partition_index = partition_index + uint64_per_entry;
    }
}

更新：这是上帝螺栓：https://godbolt.org/z/j14av3fGq 还在两个实例上运行了此代码。

Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz，运行 MacOS 13.6，16 GB DDR4 RAM，编译器 Apple clang 版本 15.0.0 (clang-1500.0.40.1)

AWS r7i.2xlarge Intel(R) Xeon(R) Platinum 8488C，带 Ubuntu，64 GB DDR5 RAM，编译器 g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

令人惊讶的是，它在 Xeon 上慢了 2 倍！！

我正在使用编译器标志 O3

Answer 1

这些大多是通用技巧，将帮助您的编译器优化代码（其中许多已经在注释中）：

对于数组索引/大小，始终使用
```
size_t
```
（如果您想允许负索引，则使用
```
ptrdiff_t
```
）。这将使编译器免于处理潜在的环绕或溢出行为。
在初始化变量时/在使用它们的范围内声明变量
将所有初始化后永不改变的变量声明为
```
const
```
。如果编译器知道它们的值，它可以用它们优化乘法或除法，特别是当它们是 2 的幂时。
使用
```
__restrict
```
关键字将您修改的数组的指针作为函数参数传递（仅当您确定该数组不与其他数组别名/重叠时才执行此操作）。这通常有助于编译器展开和 SIMD 优化。
使用
```
-march=native
```
（或
```
-march=target_architecture
```
）编译代码，以针对本地或某些特定的 CPU 架构进行优化（当然，还有
```
-O3
```
）。
只是为了便于阅读：不要写
```
A = A + B
```
，而是写
```
A += B
```
（与
```
^
```
/
```
^=
```
等相同）。这通常对编译器来说并不重要。

假设您的 godbolt-link 中的

j=j+4

是一个拼写错误，并且您的意思是

j+=1

（或

++j

），这会产生非常干净的 AVX2 代码：

// assuming these are known at compile-time:
const size_t partition_size = 4096; 
const size_t entry_size = 256; // bits
const size_t DB_size = 16777216;

uint64_t *DB = new uint64_t[DB_size * entry_size/64];
const size_t uint64_per_entry = entry_size / sizeof(uint64_t);
//uint64_t *result = new uint64_t[partition_size]; // pass this as parameter

//partition_index will be a random multiple of partition_size, e.g. 0, 8192, 4096 etc
//random_offset will be a random number in [0, partition_size]
void xor_shifted_arrays(size_t partition_index, size_t random_offset, uint64_t *__restrict result)
{

    for (size_t i = 0; i < partition_size  ; i++)
    {
        const size_t shift = (i + random_offset) & (partition_size - 1);
        const size_t shift_offset = shift * uint64_per_entry;
        
        for (size_t j = 0; j < uint64_per_entry; j++){
            result[shift_offset + j] ^= DB[partition_index + j];    
        }
        partition_index += uint64_per_entry;
    }
}

Godbolt 链接：https://godbolt.org/z/Khs5aTPaK

我没有看到这段代码有更多可能的改进。您需要将每个

result

条目与相应的

DB

条目一起读写一次。如果您可以控制内存管理，则可以确保指针与页面大小对齐（或至少与 SIMD 宽度对齐）

此外，如果（对于不同的问题）某些条目将被多次读取或修改，您可以尝试重新安排循环，以便减少从内存读取或写入变量的频率。

如何改进大型 uint64 数组的异或运算？

问题描述投票：0回答：1

1个回答

最新问题

如何改进大型 uint64 数组的异或运算？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1