为什么即使数据相同,stats.chisquare 和 stats.chi2_contingency 的结果也不同

问题描述 投票:0回答:1

我有一个名为“data_count”的数组,存储 9 个数字组的计数。我使用本福德定律生成了一个名为“expected_counts”的预期计数数组。我想测试两个数组是否具有相同的分布。我使用了 stats.chisquare 和 stats.chi2_contingency 函数,但结果却截然不同。 [scipy 指南] (https://docs.scipy.org/doc/scipy/reference/ generated/scipy.stats.chi2_contingency.html) 说他们应该有相同的结果。为什么它对我的案例不起作用?请帮助我,谢谢一百万。

res = chi2_contingency(obs, correction=False)
(res.statistic, res.pvalue) == stats.chisquare(obs.ravel(),
                                               f_exp=ex.ravel(),
                                               ddof=obs.size - 1 - dof)

这是我的代码:

import numpy as np
from scipy import stats

data_count = [34, 10, 8, 16, 14, 5, 4, 7, 4]
expected_counts = [31, 18, 13, 10, 8, 7, 6, 5, 5]

expected_percentage=[(i/sum(expected_counts))*100 for i in expected_counts]
data_percentage=[(i/sum(data_count))*100 for i in data_count]

# method 1
res1 = stats.chisquare(f_obs=data_percentage, f_exp=expected_percentage)
print(res1.pvalue)


# method 2
combined = np.array([data_count, expected_counts])

res2 = stats.chi2_contingency(combined, correction=False)

print(res2.pvalue)

输出结果为: 0.04329908403353834 0.45237501133745583

python scipy
1个回答
0
投票

chi2_contingency
的文档并不表明您的代码将为两个测试生成相同的统计数据和 p 值。如果您从列联表开始,它会显示测试之间的关系,例如:

import numpy as np
from scipy import stats
# contingency table
observed = np.array([[10, 10, 20],
                     [20, 20, 20]])
# expected under the null hypothesis of independence
expected = stats.contingency.expected_freq(observed)

# according to the documentation of `chi2_contingency`
dof = observed.size - sum(observed.shape) + observed.ndim - 1
res1 = stats.contingency.chi2_contingency(observed, correction=False)
res2= stats.chisquare(observed.ravel(), f_exp=expected.ravel(), 
                      ddof=observed.size - 1 - dof)

np.testing.assert_allclose(res1.statistic, res2.statistic)
np.testing.assert_allclose(res1.pvalue, res2.pvalue)

您没有列联表形式的数据,因此您可以简单地对原始计数使用

chisquare
- 或者,如果预期计数和观察到的计数相等,也可以。


data_count = np.asarray([34, 10, 8, 16, 14, 5, 4, 7, 4])
expected_counts = np.asarray([31, 18, 13, 10, 8, 7, 6, 5, 5])

# observed and expected counts must be equal
# assuming that the relative frequencies of your expected counts
# are correct and that it is just not normalized properly:
expected_counts = expected_counts * np.sum(data_count) / np.sum(expected_counts)
# assuming no `ddof` adjustment is needed:
res = stats.chisquare(data_count, expected_counts)
# Power_divergenceResult(statistic=16.255158633621637, pvalue=0.03887025202788101)

最新问题
© www.soinside.com 2019 - 2024. All rights reserved.