香草Python实现可在此处Categorical features correlation
在PySpark中实现相同的最佳方法是什么?
我打算通过以下方式做到这一点:
def cramers_v(df, feature1, feature2):
contingency_matrix = c16.crosstab(feature1, feature2)
contingency_matrix = contingency_matrix.toPandas().drop(feature1+'_'+feature2, axis=1)
chi2 = ss.chi2_contingency(contingency_matrix)[0]
n = contingency_matrix.sum().sum()
phi2 = chi2 / n
r, k = contingency_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))