如何并行化一组矩阵乘法

Question

考虑以下操作，其中我采用较大矩阵的 20 x 20 切片，并将它们与另一个 20 x 20 矩阵进行点积：

import numpy as np

a = np.random.rand(10, 20)
b = np.random.rand(20, 1000)

ans_list = []

for i in range(980):
    ans_list.append(
        np.dot(a, b[:, i:i+20])
    )

我知道 NumPy 并行化实际的矩阵乘法，但是如何并行化外部 for 循环，以便各个乘法同时运行而不是顺序运行？

此外，如果我想使用 GPU 做同样的事情，我该怎么做？显然，我将使用 CuPy 而不是 NumPy，但是如何同时或异步向 GPU 提交多个矩阵乘法？

PS： 请注意，上面的滑动窗口是生成多个 matmul 的示例。我知道在这种特殊情况下，一个解决方案（如下所示）是使用 NumPy 内置滑动窗口功能，但我有兴趣了解并行运行任意一组 matmul 的最佳方法（可选在 GPU 上），并且不仅仅是这个特定示例的更快解决方案。

windows = np.lib.stride_tricks.sliding_window_view(b, (20, 20)).squeeze()
ans_list = np.dot(a, windows)

Answer 1

CPU：

import numpy as np
from numpy.lib.stride_tricks import as_strided


def timer(func):
    def func_wrapper(*args, **kwargs):
        from time import time
        time_start = time()
        result = func(*args, **kwargs)
        time_end = time()
        time_spend = time_end - time_start
        print('%s cost time: %.3f s' % (func.__name__, time_spend))
        return result

    return func_wrapper


@timer
def parallel_version(a, b):
    unit = b.strides[-1]
    b_block = as_strided(b, shape=(b.shape[-1] - 20 + 1, 20, 20), strides=(unit * 20, unit * b.shape[-1], unit))
    return np.matmul(a[np.newaxis, :, :], b_block)


@timer
def sequential_version(a, b):
    ans_list = []
    for i in range(b.shape[-1] - 20 + 1):
        ans_list.append(
            np.dot(a, b[:, i:i + 20])
        )
    return np.array(ans_list)


if __name__ == '__main__':
    a = np.random.rand(10, 20)
    b = np.random.rand(20, 1000)
    x = parallel_version(a, b)
    y = sequential_version(a, b)

输出：

parallel_version cost time: 0.000 s
sequential_version cost time: 0.002 s

这段代码很简单。

as_strided

是关键。

如何并行化一组矩阵乘法

问题描述投票：0回答：1

1个回答

最新问题

如何并行化一组矩阵乘法

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1