R 和 Python 的不同加权线性回归结果,错误或只是不同的解释?

问题描述 投票:0回答:0

我在学习权重线性回归,比较R和Python的结果。然后我注意到结果与这两个有很大不同,特别是对于 F-stats。有两个测试用例。我会在介绍完两个案例后提出我的问题。

案例 1:带满秩矩阵的加权线性回归

这是一个超级简单的数据集。因此,无论使用哪种线性回归算法,系数基本上都是相同的。

蟒蛇:

import numpy as np
import statsmodels.api as sm
x = np.asarray ((18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29), dtype = float)
A = sm.add_constant (x)
y = np.asarray ((76.1, 77.0, 78.1, 78.2, 78.8, 79.7, 79.9, 81.1, 81.2, 81.8, 82.8, 83.5), dtype = float)
wt = np.asarray ((0.21, 0.15, 0.11, 0.0, -0.0, 0.11, 0.12, 0.15, 0.1, 0.2, 0.15, 0.4 ), dtype = float)
model = sm.WLS (y, A, weights = wt).fit ()
print (model.summary ())

输出:

                            WLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.992
Model:                            WLS   Adj. R-squared:                  0.991
Method:                 Least Squares   F-statistic:                     1241.
Date:                Fri, 19 May 2023   Prob (F-statistic):           8.07e-12
Time:                        11:55:00   Log-Likelihood:                   -inf
No. Observations:                  12   AIC:                               inf
Df Residuals:                      10   BIC:                               inf
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         64.6927      0.456    141.786      0.000      63.676      65.709
x1             0.6452      0.018     35.223      0.000       0.604       0.686
==============================================================================
Omnibus:                        0.164   Durbin-Watson:                   2.140
Prob(Omnibus):                  0.921   Jarque-Bera (JB):                0.363
Skew:                           0.120   Prob(JB):                        0.834
Kurtosis:                       2.183   Cond. No.                         155.
==============================================================================

R:

A <- matrix(c( 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 ), nrow = 12, ncol = 1)
y <- c(76.1, 77.0, 78.1, 78.2, 78.8, 79.7, 79.9, 81.1, 81.2, 81.8, 82.8, 83.5)
wt <- c(0.21, 0.15, 0.11, 0.0, -0.0, 0.11, 0.12, 0.15, 0.1, 0.2, 0.15, 0.4 )
model <- lm (y ~ A, weights=wt)
summary (model)

输出:

Call:
lm(formula = y ~ A, weights = wt)

Weighted Residuals:
      Min        1Q    Median        3Q       Max 
-0.140456 -0.087469  0.007881  0.056603  0.166686 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 64.69272    0.51013   126.8 1.67e-14 ***
A            0.64524    0.02048    31.5 1.12e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1071 on 8 degrees of freedom
Multiple R-squared:  0.992,     Adjusted R-squared:  0.991 
F-statistic: 992.5 on 1 and 8 DF,  p-value: 1.121e-09

有几点不同。

  1. 各个系数的标准误差和 t-stats 是不同的。
  2. F-stats 是不同的。

案例 2:带秩亏矩阵的加权线性回归

蟒蛇:

import numpy as np
import statsmodels.api as sm
x1 = np.asarray ([ -2, -1, 1, 2, 3, 4, 5 ], dtype=float)
x2 = np.asarray ([ 4, 1, 1, 4, 9, 16, 25 ], dtype=float)
x3 = np.asarray ([ 2, 4, 5, 7, 9, 12, 12 ], dtype=float)
y = np.asarray ([ 2, 3, 4, 5, 6, 7, 8 ], dtype=float)
A = np.asarray ([ np.ones (7, dtype=float), x1, x2, x2, x3 ]).T
wt = np.asarray ((0.11, 0.12, 0.0, 0.14, 0.15, 0.16, 0.15), dtype = float)
model = sm.WLS (y, A, weights = wt).fit ()
print (model.summary ())

输出:

                            WLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.998
Model:                            WLS   Adj. R-squared:                  0.996
Method:                 Least Squares   F-statistic:                     532.9
Date:                Fri, 19 May 2023   Prob (F-statistic):           0.000138
Time:                        11:53:14   Log-Likelihood:                   -inf
No. Observations:                   7   AIC:                               inf
Df Residuals:                       3   BIC:                               inf
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0725      0.378      8.125      0.004       1.869       4.276
x1             0.6204      0.111      5.597      0.011       0.268       0.973
x2             0.0141      0.006      2.397      0.096      -0.005       0.033
x3             0.0141      0.006      2.397      0.096      -0.005       0.033
x4             0.0885      0.077      1.152      0.333      -0.156       0.333
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.656
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.452
Skew:                           0.542   Prob(JB):                        0.798
Kurtosis:                       2.389   Cond. No.                     1.46e+17
==============================================================================

R:

A <- matrix(c( -2, -1, 1, 2, 3, 4, 5,
              4, 1, 1, 4, 9, 16, 25,
              4, 1, 1, 4, 9, 16, 25,
              2, 4, 5, 7, 9, 12, 12), nrow = 7, ncol = 4)
y <- c(2, 3, 4, 5, 6, 7, 8)
wt <- c(0.11, 0.12, 0.0, 0.14, 0.15, 0.16, 0.15)
model <- lm (y ~ A, weights=wt)
summary (model)

输出:

Call:
lm(formula = y ~ A, weights = wt)

Weighted Residuals:
        1         2         3         4         5         6         7 
-0.040387  0.057390  0.000000 -0.017136  0.005993 -0.027315  0.022026 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  3.07250    0.46316   6.634   0.0220 *
A1           0.62040    0.13577   4.570   0.0447 *
A2           0.02827    0.01445   1.957   0.1895  
A3                NA         NA      NA       NA  
A4           0.08849    0.09404   0.941   0.4461  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.05695 on 2 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9953 
F-statistic: 355.2 on 3 and 2 DF,  p-value: 0.002808

R 似乎使用 QR Householder 来计算线性模型,而 Python 使用伪逆。所以 A2 和 A3 的系数不同。

再次注意 f-statistics 是不同的。

问题

问题1. 为什么情况1的标准误差和t值不同?

加权矩阵A2为

array([[ 0.45825757,  8.24863625],
       [ 0.38729833,  7.35866836],
       [ 0.33166248,  6.63324958],
       [ 0.33166248,  7.62823702],
       [ 0.34641016,  8.31384388],
       [ 0.38729833,  9.68245837],
       [ 0.31622777,  8.22192192],
       [ 0.4472136 , 12.07476708],
       [ 0.38729833, 10.84435337],
       [ 0.63245553, 18.34121043]])

np.linalg.inv (A2.T @ A2) 是

array([[22.68020619, -0.89869228],
       [-0.89869228,  0.03655843]])

np.diag (np.linalg.inv (A2.T @ A2))

array([22.68020619,  0.03655843])

此外,RSS 是 0.091790911146517828

R 的标准误差计算如下所示。

rse = np.sqrt (rss / residualDF)        # (residualDF is 8 here)
standardError = rse * np.sqrt (np.diag (np.linalg.inv (A2.T @ A2)))

所以结果是

array([0.51012704, 0.02048088])

但是,我不明白 Python 是如何计算的。只是公式不同吗?还是标准错误指的是同一件事?

问题2.为什么F-stats有差异?

在我看来,计算中使用的自由度只是不同而已。谁是对的?

R 和 python 计算 RSS 为 0.091790911146517828,ESS 为 11.38803026532697

对于R,它似乎使用以下公式

    (ess / DF_ess) / (rss / DF_rss)

DF_rss = 1,DF_ess为8。所以结果为992.5192046214006

然而,Python 似乎使用了不同的公式。

    return self.mse_model/self.mse_resid

这里mse_model的值和ess的值是一样的。 mse_resid 计算为 self.ssr/self.df_resid

问题来了:这里的df_resid是10。结果,计算出的 fvalue 为 1240.649.

案例 2 F-stats 值因类似原因而不同。

哪个是对的?或者他们指的是同一件事?

python r statsmodels
© www.soinside.com 2019 - 2024. All rights reserved.