我正在尝试将 C 代码中的以下卡方函数转换为 SSE2 内在函数
我得到了这两个函数的正确输出。我使用生成的一些随机 4KB 数据测量了这两个函数运行所需的时间。平均而言,我发现性能提高了约 70-90 毫秒
我只是想知道是否有任何我缺少的进一步优化可以进一步提高性能。任何有关这方面的线索都会有帮助
普通 C 代码:
int observed[256] = {0};
double chiSquare = 0.0;
double expected = (double)size / 256; // Constant expected value
// Calculate frequency of each byte value
for (int i = 0; i < size; i++) {
observed[data[i]]++;
}
// Calculate the chi-square statistic
for (int i = 0; i < 256; i++) {
double diff = observed[i] - expected;
chiSquare += (diff * diff) / expected;
}
return chiSquare;
SSE2 内在函数:
int observed[256] = {0};
const double expected = (double)size / 256; // Make 'expected' a constant
double chiSquare = 0.0;
// Process data in 16-byte (128-bit) chunks
for (int i = 0; i < size; i += 16) {
__m128i dataChunk = _mm_loadu_si128((__m128i*)(data + i));
// Unpack 8-bit values into 16-bit values for counting
__m128i dataUnpacked = _mm_unpacklo_epi8(dataChunk, _mm_setzero_si128());
// Extract and process 8 values in parallel
for (int j = 0; j <= 1; j++) {
uint16_t values[8];
_mm_storeu_si128((__m128i*)values, dataUnpacked);
for (int k = 0; k < 8; k++) {
observed[values[k]]++;
}
dataUnpacked = _mm_unpackhi_epi8(dataChunk, _mm_setzero_si128());
}
}
// Calculate the chi-square statistic using SSE2 intrinsics
__m128d sum = _mm_setzero_pd();
for (int i = 0; i < 256; i += 2) {
__m128d observedVec = _mm_set_pd(observed[i + 1], observed[i]);
__m128d diff = _mm_sub_pd(observedVec, _mm_set1_pd(expected));
__m128d squaredDiff = _mm_mul_pd(diff, diff);
__m128d result = _mm_div_pd(squaredDiff, _mm_set1_pd(expected));
sum = _mm_add_pd(sum, result);
}
// Sum up the results in the sum
double sumArray[2];
_mm_storeu_pd(sumArray, sum);
for (int i = 0; i < 2; i++) {
chiSquare += sumArray[i];
}
return chiSquare;
}**
在我的 Westmere i5 笔记本电脑上,您的 SSE2 版本函数基准测试速度比您的标量函数慢(约 25%)。我对我的机器上的标量函数的性能做了轻微的改进(4KB 数据约提高了 20%)。另外,SSE2 函数并不适用于所有“大小”值,我相信您已经知道这一点。无论如何,我的功能如下。
double getchisquared(int size, uint8_t *data) {
double diff, chiSquare = 0.0;
double expected = (double)size / 256; // Constant expected value
int i, iterations = (size >> 2) << 2;
// Calculate frequency of each byte value
for (i = 0; i < iterations;) {
observed[data[i++]]++;
observed[data[i++]]++;
observed[data[i++]]++;
observed[data[i++]]++;
}
for (i = iterations; i < size; i++) {
observed[data[i]]++;
}
// Calculate the chi-square statistic
for (i = 0; i < 256; i++) {
diff = observed[i] - expected;
chiSquare += (diff * diff) ;
}
return chiSquare / expected;
}
我认为 SSE2 并没有像 chtz 指出的那样对直方图阶段的优化带来太大希望。你可能会更幸运地使用 AVX2,但我还没有调查过。