Polars 中的自定义校正功能

问题描述 投票:0回答:1

我想将自定义函数与 Polars 的 df.corr 函数一起使用,就像我使用 Pandas 的 df.corr 一样。我意识到自定义函数在 Polars 中的性能不如 Spearman 和 Pearson 内置函数,但有一个变通方法可能比 pandas 实现更快,尤其是对于大型数据集。功能例如:

###Pandas

from sklearn.datasets import load_iris
import pandas as pd
from scipy.stats import pearsonr

def pearsonr_pval(x,y):
    return pearsonr(x,y)[1]

df=pd.DataFrame(load_iris()['data'],columns=load_iris()['feature_names'])
r_val=df.corr()
p_val=df.corr(method=pearsonr_pval)    #<-----this line works fine

但我想做这样的事情

###Polars

from sklearn.datasets import load_iris
import polars as pl
from scipy.stats import pearsonr

def pearsonr_pval(x,y):
    return pearsonr(x,y)[1]

df=pl.DataFrame(load_iris()['data'],columns=load_iris()['feature_names'])
r_val=df.corr()
p_val=df.corr(method=pearsonr_pval)   #<----This line will not work because corr does not support custon functions
dataframe performance correlation python-polars
1个回答
1
投票

更新: 当您提供自定义函数时,这就是 pandas 正在做的事情: https://github.com/pandas-dev/pandas/blob/v1.5.3/pandas/core/frame.py#L10313-L10336

def pearson_pval(df):
    corrf = lambda x, y: pearsonr(x,y)[1]

    names = pl.DataFrame(df.columns, schema=["name"])
    matrix = df.to_numpy().T
      
    K = len(df.columns)
    correl = np.empty((K, K), dtype=float)
    mask = np.isfinite(matrix)

    for i, ac in enumerate(matrix):
        for j, bc in enumerate(matrix):
            if i > j:
                continue
            valid = mask[i] & mask[j]
            if i == j:
                c = 1.0
            elif not valid.all():
                c = corrf(ac[valid], bc[valid])
            else:
                c = corrf(ac, bc)
            correl[i, j] = c
            correl[j, i] = c

    return names.hstack(
        pl.DataFrame(correl, schema=df.columns)
    )

旧答案:

看起来像极地

.corr()
被硬编码为调用
np.corrcoef

https://github.com/pola-rs/polars/blob/master/py-polars/polars/dataframe/frame.py#L8114

我想你只需要在你的自定义函数中直接调用

pearsonr
- 比如:

def pearsonr_pval(df):
   names = pl.DataFrame(df.columns, schema=["name"])
   return names.hstack(
      pl.DataFrame(
         ([pearsonr(x, y)[1] for x in df] for y in df),
         schema = df.columns
      )
   )
>>> pearsonr_pval(df)
shape: (4, 5)
┌───────────────────┬───────────────────┬──────────────────┬───────────────────┬──────────────────┐
│ name              ┆ sepal length (cm) ┆ sepal width (cm) ┆ petal length (cm) ┆ petal width (cm) │
│ ---               ┆ ---               ┆ ---              ┆ ---               ┆ ---              │
│ str               ┆ f64               ┆ f64              ┆ f64               ┆ f64              │
╞═══════════════════╪═══════════════════╪══════════════════╪═══════════════════╪══════════════════╡
│ sepal length (cm) ┆ 0.0               ┆ 0.151898         ┆ 1.0387e-47        ┆ 2.3255e-37       │
│ sepal width (cm)  ┆ 0.151898          ┆ 0.0              ┆ 4.5133e-8         ┆ 0.000004         │
│ petal length (cm) ┆ 1.0387e-47        ┆ 4.5133e-8        ┆ 0.0               ┆ 4.6750e-86       │
│ petal width (cm)  ┆ 2.3255e-37        ┆ 0.000004         ┆ 4.6750e-86        ┆ 0.0              │
└───────────────────┴───────────────────┴──────────────────┴───────────────────┴──────────────────┘

虽然看起来 pandas 版本返回

1.0
而不是当值为
0.0
?

© www.soinside.com 2019 - 2024. All rights reserved.