我有一个包含 1222 行和 33000 列的数据框,我需要在数据框内的 16000 列与其余列之间运行关联。目前,我正在使用 python 中的
scipy.stats
库,使用 Pearson 相关方法。
这是我正在尝试的功能:
from scipy.stats import pearsonr
def correlation_analysis(lncRNA_PC_T):
"""Function for correlation analysis"""
correlations = pd.DataFrame()
for PC in [column for column in lncRNA_PC_T.columns if "_PC" in column]:
for lncRNA in [
column for column in lncRNA_PC_T.columns if "_lncRNAs" in column
]:
correlations = correlations.append(
pd.Series(
pearsonr(lncRNA_PC_T[PC], lncRNA_PC_T[lncRNA]),
index=["PCC", "p-value"],
name=PC + "_" + lncRNA,
)
)
return correlations
上面的代码正在完成其工作,但是,对于我的大小为
1222 X 33000
的数据框来说,实际上需要 30 多分钟才能完成工作。如果有人能提出提高大数据框架此功能速度的方法,那就太好了。
谢谢
我们可以在 CPU 上将其速度提高三个数量级以上,并使用 GPU 将速度提高几个数量级。
我生成了一些随机数据来使用,并像 DataFrame 一样对其进行结构化。 DataFrame 的大小被减小,以便您的代码在合理的时间内运行。
import numpy as np
import pandas as pd
rng = np.random.default_rng(4582345683546)
m = 120
n = 330
nx = 160
ny = n - nx
data = rng.random((m, n))
column_names = ([f'_PC{i}' for i in range(nx)]
+ [f'_lncRNAs{i}' for i in range(ny)])
lncRNA_PC_T = pd.DataFrame(data, columns=column_names)
原始
correlation_analysis
在常规 Colab 实例上运行需要大约 75 秒。下面的 NumPy 代码大约需要 22 毫秒,最后的测试确认生成的统计数据和 p 值数组与 DataFrame 的列相同。
import numpy as xp
from scipy.special import betainc
# import cupy as xp
# from cupyx.scipy.special import betainc
def pearsonr2(x, y):
# Assumes inputs are DataFrames and computation is to be performed
# pairwise between columns. We convert to arrays and reshape so calculation
# is performed according to normal broadcasting rules along the last axis.
x = xp.asarray(x).T[:, xp.newaxis, :]
y = xp.asarray(y).T
n = x.shape[-1]
# Compute Pearson correlation coefficient. We can't use `cov` or `corrcoef`
# because they want to compute everything pairwise between rows of a
# stacked x and y.
xm = x.mean(axis=-1, keepdims=True)
ym = y.mean(axis=-1, keepdims=True)
cov = xp.sum((x - xm) * (y - ym), axis=-1)/(n-1)
sx = xp.std(x, ddof=1, axis=-1)
sy = xp.std(y, ddof=1, axis=-1)
rho = cov/(sx * sy)
# Compute the two-sided p-values. See documentation of scipy.stats.pearsonr.
ab = n/2 - 1
x = (abs(rho) + 1)/2
p = 2*(1-betainc(ab, ab, x))
return rho, p
df_corr = correlation_analysis(lncRNA_PC_T)
%timeit correlation_analysis(lncRNA_PC_T)
# 1min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
x = lncRNA_PC_T.iloc[:, :nx]
y = lncRNA_PC_T.iloc[:, nx:]
corr2, p2 = pearsonr2(x, y)
%timeit pearsonr2(x, y)
# 21.9 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Check results
np.testing.assert_allclose(corr2.ravel(), df_corr.iloc[:, 0])
np.testing.assert_allclose(p2.ravel(), df_corr.iloc[:, 1])
通过注释掉 NumPy/SciPy 导入并取消注释 CuPy 导入,您可以进一步加快速度。