我想根据其他 N 个人的考试成绩来预测一个人的考试成绩。由于某种原因,
OLSResults.mse_model
调用无法正常工作。
我知道系数和截距项以及预测都是正确的,但由于某种原因,一次调用返回了一个非常古怪的数字,我不确定它来自哪里。
这是我的代码的 MVE。我对数据进行了硬编码,通过复制我写的 4 行来至少使用 8 行(否则如果样本数量少于 8,statsmodels 会抱怨)
使用5个人,因变量为“PersonX”
import pandas as pd
import statsmodels.api as sm
rows = [
{"Person1":79, "Person2":95, "Person3":34,"Person4":46,"Person5":10,"PersonX":50},
{"Person1":65, "Person2":88, "Person3":45,"Person4":24,"Person5":32,"PersonX":51},
{"Person1":87, "Person2":91, "Person3":23,"Person4":35,"Person5":10,"PersonX":78},
{"Person1":67, "Person2":101,"Person3":34,"Person4":55,"Person5":15,"PersonX":88},
]
# Too lazy to type out four more rows, just double the y's
rows += [{k:v*2 for k,v in r.items()} for r in rows]
exams = pd.DataFrame.from_records(rows)
Y = np.array(exams['PersonX'])
X = exams[[c for c in exams.columns if c != "PersonX"]]
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
y_pred = np.array(results.predict(X).round())
print(f"Y-Pred: {y_pred}")
print(f"Y-True: {Y}")
print(f"Mean squared error: {results.mse_model}")
这打印出来:
Y-Pred: [ 50. 51. 78. 88. 100. 102. 156. 176.]
Y-True: [ 50 51 78 88 100 102 156 176]
Mean squared error: 3611.21875
均方误差怎么这么高?应该基本为零! (减去一些舍入误差)
因此,如果您运行相同的代码,删除 exams 行下方的所有内容,并切换为
sklearn
等价物,您将拥有:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
X = exams[[c for c in exams.columns if c != "PersonX"]]
Y = np.array(exams['PersonX'])
reg = linear_model.LinearRegression(fit_intercept=True)
reg.fit(X, Y)
y_pred = reg.predict(X)
这会产生与
y_pred
包相同的“完美”statsmodels
,预测完全正确。那给了什么?
此处供参考的是每个包的系数(大致相同):
sklearn:
Coefficients: [-0.87064829 2.43878556 -4.15147725 0.08599284 2.42911432]
Intercept: -8.526512829121202e-14
sm:
Coefficients: [-0.870648294 2.43878556 -4.15147725 0.0859928361 2.42911432]
Intercept: -1.14463994e-13
截距略有不同。