PySpark中的克拉美尔V

问题描述 投票:0回答:1

香草Python实现可在此处Categorical features correlation

在PySpark中实现相同的最佳方法是什么?

python apache-spark pyspark correlation
1个回答
0
投票

我打算通过以下方式做到这一点:

def cramers_v(df, feature1, feature2):
    contingency_matrix = c16.crosstab(feature1, feature2)
    contingency_matrix = contingency_matrix.toPandas().drop(feature1+'_'+feature2, axis=1)
    chi2 = ss.chi2_contingency(contingency_matrix)[0]
    n = contingency_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = contingency_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
© www.soinside.com 2019 - 2024. All rights reserved.