numpy ndarray 列式计算的性能改进(行减少)

问题描述 投票:0回答:1

我正在 3 维 ndarray (KxMxN) 上进行行缩减,即获取列的所有值并使用缩减函数生成标量值;最终,KxMxN 矩阵将变成 KxN 阶的二维 ndarray。还有更多实现细节,我会一路解释。

3-D ndarray 是浮点数。

在下面的例子中,

njit
numpy
是我目前能得到的最好的。我想知道从任何角度来看是否还有进一步改进的空间。

cupy
(GPU 并行化)、
dask
(CPU 并行化)或
numba
并行化都未能击败以下(我的用例显然太微不足道,无法利用 GPU 的能力,而且我只有 8G GPU) 。这些工具很可能可以以更高级的方式使用,但我不知道。

from numba import njit, guvectorize, float64, int64
from math import sqrt
import numba as nb
import numpy as np
import itertools


# Create a 2D ndarray
m = np.random.rand(800,100)
# Reshape it into a list of sub-matrices
mr = m.reshape(16,50,100)

# Create an indices matrix from combinatorics 
# a typical one for me "select 8 from 16", 12870 combinations
# I do have a custom combination generator, but this is not what I wanted to optimise and itertools really has done a decent job already.
x = np.array( list(itertools.combinations(np.arange(16),8)) )

# Now we are going to select 8 sub-matrices from `mr` and reshape them to become one bigger sub-matrix; we do this in list comprehension.
# This is the matrix we are going to reduce.

# Bottleneck 1: This line takes the longest and I'd hope to improve on this line, but I am not sure there's much we could do here.

m3d = np.array([mr[idx_arr].reshape(400,100) for idx_arr in x])

# We create different versions of the same reduce function. 

# Bottleneck 2: The reduce function is another place I'd want to improve on.

# col - column values
# days - trading days in a year
# rf - risk free rate

# njit version with instance function `mean`, `std`, and python `sqrt`
@njit
def nb_sr(col, days, rf):
    mean = (col.mean() * days) - rf
    std = col.std() * sqrt(days)
    return mean / std

# njit version with numpy
@njit
def nb_sr_np(col, days, rf):
    mean = (np.mean(col) * days) -rf
    std = np.std(col) * np.sqrt(days)
    return mean / std

# guvectorize with numpy
@guvectorize([(float64[:],int64,float64,float64[:])], '(n),(),()->()', nopython=True)
def gu_sr_np(col,days,rf,res):
    mean = (np.mean(col) * days) - rf
    std = np.std(col) * np.sqrt(days)
    res[0] = mean / std

# We wrap them such that they can be applied on 2-D matrix with list comprehension.

# Bottleneck 3: I was thinking to probably vectorize this wrapper, but the closest I can get is list comprehension, which isn't really vectorization.

def nb_sr_wrapper(m2d):
    return [nb_sr(r, 252, .25) for r in m2d.T]

def nb_sr_np_wrapper(m2d):
    return [nb_sr_np(r, 252, .25) for r in m2d.T]

def gu_sr_np_wrapper(m2d):
    return [gu_sr_np(r, 252, .25) for r in m2d.T]

# Finally! here's our performance benchmarking step.

%timeit np.array( [nb_sr_wrapper(m) for m in m3d.T] )
# output: 4.26 s ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.array( [nb_sr_np_wrapper(m) for m in m3d.T] )
4.33 s ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.array( [gu_sr_np_wrapper(m) for m in m3d.T] )
6.06 s ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
python performance vectorization numpy-ndarray numba
1个回答
0
投票

我创建了并行函数来接收 3d 矩阵。还添加了

fastmath=True
,这可以提高性能。 GPU 功能可能会受益于类似的过程。


DAYS = 252
RF = 0.25


def decoratortimer(decimal):
    def decoratorfunction(f):
        def wrap(*args, **kwargs):
            time1 = time.monotonic()
            result = f(*args, **kwargs)
            time2 = time.monotonic()
            print(
                "{:s} function took {:.{}f} ms".format(
                    f.__name__, ((time2 - time1) * 1000.0), decimal
                )
            )
            return result

        return wrap

    return decoratorfunction


def get_array():
    # Create a 2D ndarray
    m = np.random.rand(800, 100)
    # Reshape it into a list of sub-matrices
    mr = m.reshape(16, 50, 100)

    # Create an indices matrix from combinatorics
    # a typical one for me "select 8 from 16", 12870 combinations
    # I do have a custom combination generator, but this is not what I wanted to optimise and itertools really has done a decent job already.
    x = np.array(list(itertools.combinations(np.arange(16), 8)))

    # Now we are going to select 8 sub-matrices from `mr` and reshape them to become one bigger sub-matrix; we do this in list comprehension.
    # This is the matrix we are going to reduce.

    # Bottleneck 1: This line takes the longest and I'd hope to improve on this line, but I am not sure there's much we could do here.

    m3d = np.array([mr[idx_arr].reshape(400, 100) for idx_arr in x])
    return m3d


@njit(
    fastmath=True,
)
def nb_sr(col, days, rf):
    mean = (col.mean() * days) - rf
    std = col.std() * sqrt(days)
    return mean / std

@njit(
    parallel=True,
)
def nb_sr_3d(m3d, days, rf):
    col_width = len(m3d.T)
    out = np.zeros((col_width * len(m3d.T[0].T)))
    for col_idx in nb.prange(len(m3d.T)):
        m2d = m3d.T[col_idx]
        for row_idx, r in enumerate(m2d.T):
            out[col_width * row_idx + col_idx] = nb_sr(r, days, rf)
    return out

@decoratortimer(2)
def apply(m3d, func):
    ret = []
    for m2d in m3d:
        for r in m2d.T:
            ret.append(func(r, DAYS, RF))

    return ret


if __name__ == "__main__":
    m3d = get_array()
    print("m3d:", m3d.shape)
    # Compile all
    resp0 = apply(m3d, nb_sr)
    resp3d = nb_sr_3d(m3d, DAYS, RF).tolist()
    print("BM no parallel")
    resp0 = apply(m3d, nb_sr)
    print("BM with parallel")
    resp3d = nb_sr_3d(m3d, DAYS, RF).tolist()
    assert np.allclose(resp0, resp3d)


在我的机器上,3d 版本花费了 665.63 毫秒,而

nb_sr
fastmath
花费了 1021.20 毫秒。我认为数组创建不会有太大改进。通过将第一并行循环映射到最小维度,可以大大改进并行处理。这样线程效率应该会提高。

© www.soinside.com 2019 - 2024. All rights reserved.