为什么statsmodels的OLS中的四次线性回归与LibreOffice Calc不匹配？

Question

我正在使用statsmodels的OLS线性回归和Patsy四次公式y ~ x + I(x**2) + I(x**3) + I(x**4)，但与LibreOffice Calc相比，得到的回归与数据不匹配。为什么这不匹配LibreOffice Calc产生的？

statsmodels代码：

import io
import numpy
import pandas
import matplotlib
import matplotlib.offsetbox
import statsmodels.tools
import statsmodels.formula.api

csv_data = """Year,CrudeRate
1999,197.0
2000,196.5
2001,194.3
2002,193.7
2003,192.0
2004,189.2
2005,189.3
2006,187.6
2007,186.9
2008,186.0
2009,185.0
2010,186.2
2011,185.1
2012,185.6
2013,185.0
2014,185.6
2015,185.4
2016,185.1
2017,183.9
"""

df = pandas.read_csv(io.StringIO(csv_data))

cause = "Malignant neoplasms"
x = df["Year"].values
y = df["CrudeRate"].values

olsdata = {"x": x, "y": y}
formula = "y ~ x + I(x**2) + I(x**3) + I(x**4)"
model = statsmodels.formula.api.ols(formula, olsdata).fit()

print(model.params)

df.plot("Year", "CrudeRate", kind="scatter", grid=True, title="Deaths from {}".format(cause))

func = numpy.poly1d(model.params.values[::-1])
matplotlib.pyplot.plot(df["Year"], func(df["Year"]))

matplotlib.pyplot.show()

产生以下系数：

Intercept    9.091650e-08
x            9.127904e-05
I(x ** 2)    6.109623e-02
I(x ** 3)   -6.059164e-05
I(x ** 4)    1.503399e-08

以下图表：

但是，如果我将数据带入LibreOffice Calc，点击图表并选择“插入趋势线...”，选择“多项式”，输入“度数”= 4，然后选择“显示公式”，得到的趋势线为与statsmodels不同，似乎更接近：

系数是：

Intercept = 1.35e10
x =          2.69e7
x^2 =       -2.01e4
x^3 =          6.69
x^4 =      -0.83e-3

州模型版本：

$ pip3 list | grep statsmodels
statsmodels                  0.9.0

编辑：Cubic也不匹配，但是二次方确实。

编辑：缩小Year（并在LibreOffice中执行相同操作）匹配：

df = pandas.read_csv(io.StringIO(csv_data))
df["Year"] = df["Year"] - 1998

缩小后的系数和图：

Intercept    197.762384
x             -0.311548
I(x ** 2)     -0.315944
I(x ** 3)      0.031304
I(x ** 4)     -0.000833

Answer 1

基于@Josef的评论，问题是大数不适用于高阶多项式，而statsmodels不会自动缩放域。另外，我没有在原始问题中提到这个，因为我没想到域需要转换，但我还需要根据年份预测样本外的值，所以我把它做成了范围结束：

predict_x = +5
min_scaled_domain = -1
max_scaled_domain = +1
df["Year"] = df["Year"].transform(lambda x: numpy.interp(x, (x.min(), x.max() + predict_x), (min_scaled_domain, max_scaled_domain)))

这种转变创造了一个合适的回归：

如果在LibreOffice Calc中应用相同的域转换，则系数匹配。

最后，打印预测值：

func = numpy.polynomial.Polynomial(model.params)
print(func(max_scaled_domain))

为什么statsmodels的OLS中的四次线性回归与LibreOffice Calc不匹配？

问题描述投票：1回答：1

1个回答

最新问题

为什么statsmodels的OLS中的四次线性回归与LibreOffice Calc不匹配？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1