提高 SciPy Pearson 相关性的许多成对计算的速度?

问题描述 投票:0回答:1

我有一个包含 1222 行和 33,000 列的数据框。我需要计算前 16,000 列和其余列之间的成对相关系数(以及相关的 p 值)。目前,我正在使用

scipy.stats.pearsonr
,例如:

from scipy.stats import pearsonr


def correlation_analysis(lncRNA_PC_T):
    """Function for correlation analysis"""
    correlations = pd.DataFrame()
    for PC in [column for column in lncRNA_PC_T.columns if "_PC" in column]:
        for lncRNA in [
            column for column in lncRNA_PC_T.columns if "_lncRNAs" in column
        ]:
            correlations = correlations.append(
                pd.Series(
                    pearsonr(lncRNA_PC_T[PC], lncRNA_PC_T[lncRNA]),
                    index=["PCC", "p-value"],
                    name=PC + "_" + lncRNA,
                )
            )

    return correlations

上面的代码正在完成它的工作;但是,对于我的大型数据框,需要 30 多分钟才能完成。我怎样才能加快速度?

python correlation scipy.stats
1个回答
0
投票

我们可以在 CPU 上将其速度提高三个数量级以上,并使用 GPU 将速度提高几个数量级。

我生成了一些随机数据来使用,并像 DataFrame 一样对其进行结构化。 DataFrame 的大小被减小,以便您的代码在合理的时间内运行。

import numpy as np
import pandas as pd

rng = np.random.default_rng(4582345683546)

m = 120
n = 330
nx = 160
ny = n - nx

data = rng.random((m, n))
column_names = ([f'_PC{i}' for i in range(nx)] 
                + [f'_lncRNAs{i}' for i in range(ny)])
lncRNA_PC_T = pd.DataFrame(data, columns=column_names)

原始

correlation_analysis
在常规 Colab 实例上运行需要大约 75 秒。下面的 NumPy 代码大约需要 22 毫秒,最后的测试确认生成的统计数据和 p 值数组与 DataFrame 的列相同。

import numpy as xp
from scipy.special import betainc

# import cupy as xp
# from cupyx.scipy.special import betainc

def pearsonr2(x, y):
    # Assumes inputs are DataFrames and computation is to be performed
    # pairwise between columns. We convert to arrays and reshape so calculation
    # is performed according to normal broadcasting rules along the last axis.
    x = xp.asarray(x).T[:, xp.newaxis, :]
    y = xp.asarray(y).T
    n = x.shape[-1]

    # Compute Pearson correlation coefficient. We can't use `cov` or `corrcoef`
    # because they want to compute everything pairwise between rows of a
    # stacked x and y.
    xm = x.mean(axis=-1, keepdims=True)
    ym = y.mean(axis=-1, keepdims=True)
    cov = xp.sum((x - xm) * (y - ym), axis=-1)/(n-1)
    sx = xp.std(x, ddof=1, axis=-1)
    sy = xp.std(y, ddof=1, axis=-1)
    rho = cov/(sx * sy)

    # Compute the two-sided p-values. See documentation of scipy.stats.pearsonr.
    ab = n/2 - 1
    x = (abs(rho) + 1)/2
    p = 2*(1-betainc(ab, ab, x))
    return rho, p

df_corr = correlation_analysis(lncRNA_PC_T)
%timeit correlation_analysis(lncRNA_PC_T)
# 1min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

x = lncRNA_PC_T.iloc[:, :nx]
y = lncRNA_PC_T.iloc[:, nx:]
corr2, p2 = pearsonr2(x, y)
%timeit pearsonr2(x, y)
# 21.9 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Check results
np.testing.assert_allclose(corr2.ravel(), df_corr.iloc[:, 0])
np.testing.assert_allclose(p2.ravel(), df_corr.iloc[:, 1])

通过注释掉 NumPy/SciPy 导入并取消注释 CuPy 导入,您可以进一步加快速度。

© www.soinside.com 2019 - 2024. All rights reserved.