Polars 中的自定义校正功能

Question

我想将自定义函数与 Polars 的 df.corr 函数一起使用，就像我使用 Pandas 的 df.corr 一样。我意识到自定义函数在 Polars 中的性能不如 Spearman 和 Pearson 内置函数，但有一个变通方法可能比 pandas 实现更快，尤其是对于大型数据集。功能例如：

###Pandas

from sklearn.datasets import load_iris
import pandas as pd
from scipy.stats import pearsonr

def pearsonr_pval(x,y):
    return pearsonr(x,y)[1]

df=pd.DataFrame(load_iris()['data'],columns=load_iris()['feature_names'])
r_val=df.corr()
p_val=df.corr(method=pearsonr_pval)    #<-----this line works fine

但我想做这样的事情

###Polars

from sklearn.datasets import load_iris
import polars as pl
from scipy.stats import pearsonr

def pearsonr_pval(x,y):
    return pearsonr(x,y)[1]

df=pl.DataFrame(load_iris()['data'],columns=load_iris()['feature_names'])
r_val=df.corr()
p_val=df.corr(method=pearsonr_pval)   #<----This line will not work because corr does not support custon functions

Answer 1

更新： 当您提供自定义函数时，这就是 pandas 正在做的事情： https://github.com/pandas-dev/pandas/blob/v1.5.3/pandas/core/frame.py#L10313-L10336

def pearson_pval(df):
    corrf = lambda x, y: pearsonr(x,y)[1]

    names = pl.DataFrame(df.columns, schema=["name"])
    matrix = df.to_numpy().T
      
    K = len(df.columns)
    correl = np.empty((K, K), dtype=float)
    mask = np.isfinite(matrix)

    for i, ac in enumerate(matrix):
        for j, bc in enumerate(matrix):
            if i > j:
                continue
            valid = mask[i] & mask[j]
            if i == j:
                c = 1.0
            elif not valid.all():
                c = corrf(ac[valid], bc[valid])
            else:
                c = corrf(ac, bc)
            correl[i, j] = c
            correl[j, i] = c

    return names.hstack(
        pl.DataFrame(correl, schema=df.columns)
    )

旧答案：

看起来像极地

.corr()

被硬编码为调用

np.corrcoef

：

https://github.com/pola-rs/polars/blob/master/py-polars/polars/dataframe/frame.py#L8114

我想你只需要在你的自定义函数中直接调用

pearsonr

- 比如：

def pearsonr_pval(df):
   names = pl.DataFrame(df.columns, schema=["name"])
   return names.hstack(
      pl.DataFrame(
         ([pearsonr(x, y)[1] for x in df] for y in df),
         schema = df.columns
      )
   )

>>> pearsonr_pval(df)
shape: (4, 5)
┌───────────────────┬───────────────────┬──────────────────┬───────────────────┬──────────────────┐
│ name              ┆ sepal length (cm) ┆ sepal width (cm) ┆ petal length (cm) ┆ petal width (cm) │
│ ---               ┆ ---               ┆ ---              ┆ ---               ┆ ---              │
│ str               ┆ f64               ┆ f64              ┆ f64               ┆ f64              │
╞═══════════════════╪═══════════════════╪══════════════════╪═══════════════════╪══════════════════╡
│ sepal length (cm) ┆ 0.0               ┆ 0.151898         ┆ 1.0387e-47        ┆ 2.3255e-37       │
│ sepal width (cm)  ┆ 0.151898          ┆ 0.0              ┆ 4.5133e-8         ┆ 0.000004         │
│ petal length (cm) ┆ 1.0387e-47        ┆ 4.5133e-8        ┆ 0.0               ┆ 4.6750e-86       │
│ petal width (cm)  ┆ 2.3255e-37        ┆ 0.000004         ┆ 4.6750e-86        ┆ 0.0              │
└───────────────────┴───────────────────┴──────────────────┴───────────────────┴──────────────────┘

虽然看起来 pandas 版本返回

1.0

而不是当值为

0.0

?

Polars 中的自定义校正功能

问题描述投票：0回答：1

1个回答

最新问题

Polars 中的自定义校正功能

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1