我在学习权重线性回归,比较R和Python的结果。然后我注意到结果与这两个有很大不同,特别是对于 F-stats。有两个测试用例。我会在介绍完两个案例后提出我的问题。
案例 1:带满秩矩阵的加权线性回归
这是一个超级简单的数据集。因此,无论使用哪种线性回归算法,系数基本上都是相同的。
蟒蛇:
import numpy as np
import statsmodels.api as sm
x = np.asarray ((18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29), dtype = float)
A = sm.add_constant (x)
y = np.asarray ((76.1, 77.0, 78.1, 78.2, 78.8, 79.7, 79.9, 81.1, 81.2, 81.8, 82.8, 83.5), dtype = float)
wt = np.asarray ((0.21, 0.15, 0.11, 0.0, -0.0, 0.11, 0.12, 0.15, 0.1, 0.2, 0.15, 0.4 ), dtype = float)
model = sm.WLS (y, A, weights = wt).fit ()
print (model.summary ())
输出:
WLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.992
Model: WLS Adj. R-squared: 0.991
Method: Least Squares F-statistic: 1241.
Date: Fri, 19 May 2023 Prob (F-statistic): 8.07e-12
Time: 11:55:00 Log-Likelihood: -inf
No. Observations: 12 AIC: inf
Df Residuals: 10 BIC: inf
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 64.6927 0.456 141.786 0.000 63.676 65.709
x1 0.6452 0.018 35.223 0.000 0.604 0.686
==============================================================================
Omnibus: 0.164 Durbin-Watson: 2.140
Prob(Omnibus): 0.921 Jarque-Bera (JB): 0.363
Skew: 0.120 Prob(JB): 0.834
Kurtosis: 2.183 Cond. No. 155.
==============================================================================
R:
A <- matrix(c( 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 ), nrow = 12, ncol = 1)
y <- c(76.1, 77.0, 78.1, 78.2, 78.8, 79.7, 79.9, 81.1, 81.2, 81.8, 82.8, 83.5)
wt <- c(0.21, 0.15, 0.11, 0.0, -0.0, 0.11, 0.12, 0.15, 0.1, 0.2, 0.15, 0.4 )
model <- lm (y ~ A, weights=wt)
summary (model)
输出:
Call:
lm(formula = y ~ A, weights = wt)
Weighted Residuals:
Min 1Q Median 3Q Max
-0.140456 -0.087469 0.007881 0.056603 0.166686
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.69272 0.51013 126.8 1.67e-14 ***
A 0.64524 0.02048 31.5 1.12e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1071 on 8 degrees of freedom
Multiple R-squared: 0.992, Adjusted R-squared: 0.991
F-statistic: 992.5 on 1 and 8 DF, p-value: 1.121e-09
有几点不同。
案例 2:带秩亏矩阵的加权线性回归
蟒蛇:
import numpy as np
import statsmodels.api as sm
x1 = np.asarray ([ -2, -1, 1, 2, 3, 4, 5 ], dtype=float)
x2 = np.asarray ([ 4, 1, 1, 4, 9, 16, 25 ], dtype=float)
x3 = np.asarray ([ 2, 4, 5, 7, 9, 12, 12 ], dtype=float)
y = np.asarray ([ 2, 3, 4, 5, 6, 7, 8 ], dtype=float)
A = np.asarray ([ np.ones (7, dtype=float), x1, x2, x2, x3 ]).T
wt = np.asarray ((0.11, 0.12, 0.0, 0.14, 0.15, 0.16, 0.15), dtype = float)
model = sm.WLS (y, A, weights = wt).fit ()
print (model.summary ())
输出:
WLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.998
Model: WLS Adj. R-squared: 0.996
Method: Least Squares F-statistic: 532.9
Date: Fri, 19 May 2023 Prob (F-statistic): 0.000138
Time: 11:53:14 Log-Likelihood: -inf
No. Observations: 7 AIC: inf
Df Residuals: 3 BIC: inf
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.0725 0.378 8.125 0.004 1.869 4.276
x1 0.6204 0.111 5.597 0.011 0.268 0.973
x2 0.0141 0.006 2.397 0.096 -0.005 0.033
x3 0.0141 0.006 2.397 0.096 -0.005 0.033
x4 0.0885 0.077 1.152 0.333 -0.156 0.333
==============================================================================
Omnibus: nan Durbin-Watson: 2.656
Prob(Omnibus): nan Jarque-Bera (JB): 0.452
Skew: 0.542 Prob(JB): 0.798
Kurtosis: 2.389 Cond. No. 1.46e+17
==============================================================================
R:
A <- matrix(c( -2, -1, 1, 2, 3, 4, 5,
4, 1, 1, 4, 9, 16, 25,
4, 1, 1, 4, 9, 16, 25,
2, 4, 5, 7, 9, 12, 12), nrow = 7, ncol = 4)
y <- c(2, 3, 4, 5, 6, 7, 8)
wt <- c(0.11, 0.12, 0.0, 0.14, 0.15, 0.16, 0.15)
model <- lm (y ~ A, weights=wt)
summary (model)
输出:
Call:
lm(formula = y ~ A, weights = wt)
Weighted Residuals:
1 2 3 4 5 6 7
-0.040387 0.057390 0.000000 -0.017136 0.005993 -0.027315 0.022026
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.07250 0.46316 6.634 0.0220 *
A1 0.62040 0.13577 4.570 0.0447 *
A2 0.02827 0.01445 1.957 0.1895
A3 NA NA NA NA
A4 0.08849 0.09404 0.941 0.4461
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05695 on 2 degrees of freedom
Multiple R-squared: 0.9981, Adjusted R-squared: 0.9953
F-statistic: 355.2 on 3 and 2 DF, p-value: 0.002808
R 似乎使用 QR Householder 来计算线性模型,而 Python 使用伪逆。所以 A2 和 A3 的系数不同。
再次注意 f-statistics 是不同的。
问题
问题1. 为什么情况1的标准误差和t值不同?
加权矩阵A2为
array([[ 0.45825757, 8.24863625],
[ 0.38729833, 7.35866836],
[ 0.33166248, 6.63324958],
[ 0.33166248, 7.62823702],
[ 0.34641016, 8.31384388],
[ 0.38729833, 9.68245837],
[ 0.31622777, 8.22192192],
[ 0.4472136 , 12.07476708],
[ 0.38729833, 10.84435337],
[ 0.63245553, 18.34121043]])
np.linalg.inv (A2.T @ A2) 是
array([[22.68020619, -0.89869228],
[-0.89869228, 0.03655843]])
np.diag (np.linalg.inv (A2.T @ A2))
array([22.68020619, 0.03655843])
此外,RSS 是 0.091790911146517828
R 的标准误差计算如下所示。
rse = np.sqrt (rss / residualDF) # (residualDF is 8 here)
standardError = rse * np.sqrt (np.diag (np.linalg.inv (A2.T @ A2)))
所以结果是
array([0.51012704, 0.02048088])
但是,我不明白 Python 是如何计算的。只是公式不同吗?还是标准错误指的是同一件事?
问题2.为什么F-stats有差异?
在我看来,计算中使用的自由度只是不同而已。谁是对的?
R 和 python 计算 RSS 为 0.091790911146517828,ESS 为 11.38803026532697
对于R,它似乎使用以下公式
(ess / DF_ess) / (rss / DF_rss)
DF_rss = 1,DF_ess为8。所以结果为992.5192046214006
然而,Python 似乎使用了不同的公式。
return self.mse_model/self.mse_resid
这里mse_model的值和ess的值是一样的。 mse_resid 计算为 self.ssr/self.df_resid
问题来了:这里的df_resid是10。结果,计算出的 fvalue 为 1240.649.
案例 2 F-stats 值因类似原因而不同。
哪个是对的?或者他们指的是同一件事?