标准矩阵乘法算法的运行时间在放大时比预期慢（2^10+ x 2^10+ 元素）

Question

我有一个标准的矩阵乘法算法。

// matrix.h

typedef struct Matrix {
    int rows;
    int columns;
    double* elements;
} Matrix;
double* addressAt(Matrix* matrix, int row, int column);
double* rowAt(Matrix* matrix, int row);
double dotProductColumnArray(Matrix* matrix, const int column, double* array);
void mmProduct(Matrix* first, Matrix* second, Matrix* result);



// matrix.c

// Helper function to find address of elements at row and column.
inline double* addressAt(Matrix* matrix, int row, int column) {
    return &(matrix->elements[row * matrix->columns + column]);
}

// Helper function to find row array
inline double* rowAt(Matrix* matrix, int row) {
    return &(matrix->elements[row * matrix->columns]);
}

// Finds the dot product of a column of a matrix and an array.
double dotProductColumnArray(Matrix* matrix, const int column, double* array) {
    double sum = 0.0;
    const int rows = matrix->rows, columns = matrix->columns;
    int j = column;
    const double* elements = matrix->elements;
    for(int i = 0; i < rows; i++) {
        sum += array[i] * elements[j];
        j += columns;
    }
    return sum;
}

void mmProduct(Matrix* first, Matrix* second, Matrix* result) {
    const int rows = result->rows;
    const int columns = result->columns;
    for(int i = 0; i < rows; i++) {
        double* row = rowAt(first, i);
        for(int j = 0; j < columns; j++) {
            *addressAt(result, i, j) = dotProductColumnArray(second, j, row);
        }
    }
}



// Snippet of main.c

// Fills the matrix with random numbers between -1 and 1
void randomFill(Matrix* matrix);

int main() {
    struct timeval timestamp;
    long start, now;
    Matrix first, second, result;
    
    // 2^10
    first = createMatrix((int) pow(2, 10), (int) pow(2, 10));
    second = createMatrix((int) pow(2, 10), (int) pow(2, 10));
    randomFill(&first);
    randomFill(&second);
    result = createMatrix((int) pow(2, 10), (int) pow(2, 10));

    gettimeofday(&timestamp, NULL);
    start = timestamp.tv_sec * 1000 + timestamp.tv_usec / 1000;
    mmProduct(&first, &second, &result);
    gettimeofday(&timestamp, NULL);
    now = timestamp.tv_sec * 1000 + timestamp.tv_usec / 1000;
    printf("Time taken: %ldms\n", now - start);
    
    deleteMatrix(&first);
    deleteMatrix(&second);
    deleteMatrix(&result);
    
    // Same code as above but for 2^11, 2^12, replacing pow() when necessary.
    // ...
    // ...
}

完整代码在这里自己运行： https://gist.github.com/kevinMEH/c37bda728bf5ee53c32a405ef611e5a7

当我用两个 2^10 x 2^10 矩阵运行算法时，相乘大约需要 2 秒。

当我用两个 2^11 x 2^11 矩阵运行算法时，我期望算法在 8 * 2 秒 ~ 16 秒内将它们相乘。然而，乘法需要 36 秒。

当我用两个 2^12 x 2^12 矩阵运行算法时，我期望算法在 64 * 2000 毫秒 ~ 128 秒内将它们相乘。最坏的情况，我预计 8 * 36 = 288 秒会相乘。但是，乘法需要 391 秒。

（使用带有 -O1 的 Apple Clang 14.0.0 编译，但带有 -Ofast 并且没有优化，它的缩放比例类似，仍然不是 8 倍，而是更大。在 Apple M1 芯片上运行。）

我很困惑为什么会这样。执行的加法和乘法运算的次数恰好是 2n^3，当我将矩阵的宽度增加 2 倍时，我预计后续的乘法运算每次只需要 2^3 = 8 倍的时间。

嫌疑人：

addressAt 和 rowAt 辅助函数导致速度变慢：这没有意义；加法和乘法的数量按 n^3 缩放，但对 addressAt 的调用仅按 n^2 缩放，而 rowAt 仅按 n 缩放，因此我们实际上应该看到速度相对增加。 (8 * (n^3 + n^2 + n) > 8n^3 + 4n^2 + 2n) 我也手动内联函数，发现没有明显的速度提升，所以这应该不是因为函数调用开销。

次正规数：我的随机数生成器得到一个随机 u64，并将它除以 2 * 0x8..0u，产生一个介于 0 和 1 之间的数字，因此应该没有次正规数。尽管如此，在 randomFill 函数中，我将所有内容乘以 100000，结果仍然相同。

dotProductColumnArray()：我怀疑这是罪魁祸首，但我找不到原因。我最好的猜测是速度变慢是因为 CPU 预取了错误的列元素。当我们按列数递增 j 时，CPU 对此感到困惑并且不会预取

matrix->elements[j + n * columns]

处的数组元素，因为它不希望 j 按列递增（所以它预取 j，j + 1，或其他，但不是 j + 列），但这对我来说没有意义，因为

columns

是常数，所以 CPU 应该知道得更多。

旁注：

我实现了一个执行 Strassen 矩阵乘法的 mmStrassen 算法，当我测试它时，预计每次长 7 倍。下面一起测试！常规的 mmProduct 函数，所以这不是运行时间问题。

也欢迎任何优化建议，谢谢。

提前对这个长问题表示感谢和歉意。

Answer 1

候选人改进：使用

restrict

，

const

.

restrict

允许编译器假设更改

matrix[]

不会影响

array[]

（即 - 它们不重叠）并相应地进行优化。如果没有

restrict

，编译器必须假设更改

matrix[]

可能会更改

array[]

并且需要一次做所有事情，而不是一些并行性。

const

也可能有一些边际效益。

// double dotProductColumnArray(
//     Matrix* matrix, const int column, double* array) {
double dotProductColumnArray(
    Matrix* restrict matrix, const int column, const double* restrict array) {

同样

// void mmProduct(Matrix* first, 
//     Matrix* second, Matrix* result) {
void mmProduct(const Matrix* restrict first, 
    const Matrix* restrict second, Matrix* restrict result) {

标准矩阵乘法算法的运行时间在放大时比预期慢（2^10+ x 2^10+ 元素）

问题描述投票：0回答：1

嫌疑人：

旁注：

1个回答

最新问题

标准矩阵乘法算法的运行时间在放大时比预期慢（2^10+ x 2^10+ 元素）

问题描述 投票：0回答：1

嫌疑人：

旁注：

1个回答

最新问题

问题描述投票：0回答：1