如何进一步优化矩阵乘法的性能?

问题描述 投票:1回答:2

我正在尝试优化在单核上运行的矩阵乘法代码。我如何进一步改善循环展开FMA / SSE的性能?我也很好奇,如果在内部循环中使用四个而不是两个总和,为什么性能不会提高。

问题大小是1000x1000的矩阵乘法。 gcc 9和icc 19.0.5均可用。 Intel Xeon @ 3.10GHz,32K L1d缓存,Skylake架构。编译为gcc -O3 -mavx

void mmult(double* A, double* B, double* C)
{
    const int block_size = 64 / sizeof(double);

    __m256d sum[2], broadcast;

    for (int i0 = 0; i0 < SIZE_M; i0 += block_size) {
    for (int k0 = 0; k0 < SIZE_N; k0 += block_size) {
    for (int j0 = 0; j0 < SIZE_K; j0 += block_size) {

        int imax = i0 + block_size > SIZE_M ? SIZE_M : i0 + block_size;
        int kmax = k0 + block_size > SIZE_N ? SIZE_N : k0 + block_size;
        int jmax = j0 + block_size > SIZE_K ? SIZE_K : j0 + block_size;

        for (int i1 = i0; i1 < imax; i1++) {

            for (int k1 = k0; k1 < kmax; k1++) {

                broadcast = _mm256_broadcast_sd(A+i1*SIZE_N+k1);
                for (int j1 = j0; j1 < jmax; j1+=8) {
                    sum[0] = _mm256_load_pd(C+i1*SIZE_K+j1+0);
                    sum[0] = _mm256_add_pd(sum[0], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+0)));
                    _mm256_store_pd(C+i1*SIZE_K+j1+0, sum[0]);

                    sum[1] = _mm256_load_pd(C+i1*SIZE_K+j1+4);
                    sum[1] = _mm256_add_pd(sum[1], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+4)));
                    _mm256_store_pd(C+i1*SIZE_K+j1+4, sum[1]);

                    // doesn't improve performance.. why?
                    // sum[2] = _mm256_load_pd(C+i1*SIZE_K+j1+8);
                    // sum[2] = _mm256_add_pd(sum[2], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+8)));
                    // _mm256_store_pd(C+i1*SIZE_K+j1+8, sum[2]);

                    // sum[3] = _mm256_load_pd(C+i1*SIZE_K+j1+12);
                    // sum[3] = _mm256_add_pd(sum[3], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+12)));
                    // _mm256_store_pd(C+i1*SIZE_K+j1+4, sum[3]);
                }
            }
        }
    }
    }
    }
}
c performance matrix-multiplication simd avx
2个回答
3
投票

此代码每个FMA有2个负载(如果发生FMA收缩),但是理论上Skylake每个FMA仅支持至多


1
投票
© www.soinside.com 2019 - 2024. All rights reserved.