当 n = r

问题描述 投票:0回答:1

我最近发现这个答案,它提供了 Cramer V 的无偏版本的代码,用于计算两个分类变量的相关性:

import scipy.stats as ss

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))))

但是,如果样本数

n
等于第一个特征
r
的类别数,则
rcorr = n - (n-1) = 1
,如果
np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))
为非负数,则会在
(kcorr-1)
中除以零。我用一个简单的例子证实了这一点:

import pandas as pd

data = [
    {'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'},
    {'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'},
    {'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'},
    {'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'},
    ]

df = pd.DataFrame(data) 

confusion_matrix = pd.crosstab(df['name'], df['occupation']) # n = 4 (number of samples), r = 4 (number of unique names), k = 3 (number of unique occupations)
print(cramers_corrected_stat(confusion_matrix))

输出:

/tmp/ipykernel_227998/749514942.py:45: RuntimeWarning: invalid value encountered in scalar divide
  return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
nan

这是预期的行为吗?

如果是这样,在

n = k
的情况下,例如,当所有样本对于某些特征都具有唯一值时,我应该如何使用校正后的 Cramer's V?

python pandas data-science correlation categorical-data
1个回答
0
投票

您可以通过引入一个小扰动来处理

n=r
时除以零的问题。我这样修改了你的函数:

你原来的功能:

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))))

成为

def cramers_corrected_stat(confusion_matrix):
    """Calculate Cramers V statistic for categorical-categorical association.
       Uses correction from Bergsma and Wicher,
       Journal of the Korean Statistical Society 42 (2013): 323-328.
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()  
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    
    denominator = min((kcorr-1), (rcorr-1))
    if denominator <= 0:
        return 0
    else:
        return np.sqrt(phi2corr / denominator)

您的样本数据(带有

n = 4, r = 4, k=3
):

data = [
    {'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'},
    {'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'},
    {'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'},
    {'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'},
]

df = pd.DataFrame(data)

confusion_matrix = pd.crosstab(df['name'], df['occupation']) 
result = cramers_corrected_stat(confusion_matrix)
print(f"Cramer's V Result: {result}")

你会得到

Cramer's V Result: 0
© www.soinside.com 2019 - 2024. All rights reserved.