我想将自定义函数与 Polars 的 df.corr 函数一起使用,就像我使用 Pandas 的 df.corr 一样。我意识到自定义函数在 Polars 中的性能不如 Spearman 和 Pearson 内置函数,但有一个变通方法可能比 pandas 实现更快,尤其是对于大型数据集。功能例如:
###Pandas
from sklearn.datasets import load_iris
import pandas as pd
from scipy.stats import pearsonr
def pearsonr_pval(x,y):
return pearsonr(x,y)[1]
df=pd.DataFrame(load_iris()['data'],columns=load_iris()['feature_names'])
r_val=df.corr()
p_val=df.corr(method=pearsonr_pval) #<-----this line works fine
但我想做这样的事情
###Polars
from sklearn.datasets import load_iris
import polars as pl
from scipy.stats import pearsonr
def pearsonr_pval(x,y):
return pearsonr(x,y)[1]
df=pl.DataFrame(load_iris()['data'],columns=load_iris()['feature_names'])
r_val=df.corr()
p_val=df.corr(method=pearsonr_pval) #<----This line will not work because corr does not support custon functions
更新: 当您提供自定义函数时,这就是 pandas 正在做的事情: https://github.com/pandas-dev/pandas/blob/v1.5.3/pandas/core/frame.py#L10313-L10336
def pearson_pval(df):
corrf = lambda x, y: pearsonr(x,y)[1]
names = pl.DataFrame(df.columns, schema=["name"])
matrix = df.to_numpy().T
K = len(df.columns)
correl = np.empty((K, K), dtype=float)
mask = np.isfinite(matrix)
for i, ac in enumerate(matrix):
for j, bc in enumerate(matrix):
if i > j:
continue
valid = mask[i] & mask[j]
if i == j:
c = 1.0
elif not valid.all():
c = corrf(ac[valid], bc[valid])
else:
c = corrf(ac, bc)
correl[i, j] = c
correl[j, i] = c
return names.hstack(
pl.DataFrame(correl, schema=df.columns)
)
旧答案:
看起来像极地
.corr()
被硬编码为调用np.corrcoef
:
https://github.com/pola-rs/polars/blob/master/py-polars/polars/dataframe/frame.py#L8114
我想你只需要在你的自定义函数中直接调用
pearsonr
- 比如:
def pearsonr_pval(df):
names = pl.DataFrame(df.columns, schema=["name"])
return names.hstack(
pl.DataFrame(
([pearsonr(x, y)[1] for x in df] for y in df),
schema = df.columns
)
)
>>> pearsonr_pval(df)
shape: (4, 5)
┌───────────────────┬───────────────────┬──────────────────┬───────────────────┬──────────────────┐
│ name ┆ sepal length (cm) ┆ sepal width (cm) ┆ petal length (cm) ┆ petal width (cm) │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════════╪═══════════════════╪══════════════════╪═══════════════════╪══════════════════╡
│ sepal length (cm) ┆ 0.0 ┆ 0.151898 ┆ 1.0387e-47 ┆ 2.3255e-37 │
│ sepal width (cm) ┆ 0.151898 ┆ 0.0 ┆ 4.5133e-8 ┆ 0.000004 │
│ petal length (cm) ┆ 1.0387e-47 ┆ 4.5133e-8 ┆ 0.0 ┆ 4.6750e-86 │
│ petal width (cm) ┆ 2.3255e-37 ┆ 0.000004 ┆ 4.6750e-86 ┆ 0.0 │
└───────────────────┴───────────────────┴──────────────────┴───────────────────┴──────────────────┘
虽然看起来 pandas 版本返回
1.0
而不是当值为 0.0
?