我正在使用波士顿房价数据进行一些线性回归。
如何删除高残差和高杠杆点,以便重新运行线性回归模型并重新绘制影响图和 Q-Q 图?
输入:
m = ols('PRICE ~ CRIM + RM + PTRATIO',bos).fit()
print(m.summary())
截断输出:
coef std err t P>|t| [0.025 0.975]
Intercept -3.3066 4.038 -0.819 0.413 -11.240 4.627
CRIM -0.2021 0.032 -6.301 0.000 -0.265 -0.139
RM 7.3816 0.402 18.360 0.000 6.592 8.171
PTRATIO -1.0742 0.133 -8.081 0.000 -1.335 -0.813
影响图:
QQ剧情:
对于学生化残差,有经验法则将观察值标记为可能的异常值:
Studentized Residual value for any observation > |3|
您可以使用
statsmodels
库轻松找到这些观察结果。
下面的代码可以帮助您找到任何观察的学生化残差值 > |3| 的观察
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.outliers_influence import OLSInfluence as olsi
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline
lrmodel = OLS(y_train, x_train)
results = lrmodel.fit()
studentized_residuals = olsi(results).resid_studentized
keep_observ_at_indx = [i for i in studentized_residuals if abs(i) > 3] # applying the above mentioned thumb rule
leverage_pts = olsi(results).hat_matrix_diag # this will give the array of leverage values
sb.residplot(x = studentized_residuals, y = leverage_pts, color = 'brown')
plt.show()
a.)现在,我们终于有了 Studentized_residuals > 3 的索引,用这些索引获取观察结果。
b.) 据我从互联网上了解到的信息,我认为库克距离将帮助我们消除高杠杆点。但我不确定多大才算“太大”!所以不能对此过多评论。以下是计算“库克距离”的方法
cook_dist = dict(olsi(result).cooks_distance[0])
# {key(index) : value(cook's distance)}
from statsmodels.stats.outliers_influence import OLSInfluence
model = sm.ols('y ~ x', data=df) # df is the data with columns x, y
model = model.fit()
studentized_residuals = OLSInfluence(model).summary_frame().student_resid
leverage = OLSInfluence(model).summary_frame().hat_diag
studentized_residual_threshold = 3
p = 2 # p is the number of model parameters including the intercept
n = df.shape[0] # n is the number of observations
leverage_threshold = 3 * (p/n)
outlier_index = list(set(list(studentized_residuals[abs(studentized_residuals) > studentized_residual_threshold].index) + list(leverage[leverage > leverage_threshold].index)))
# remove the outliers
df.drop(index=outlier_index, inplace=True)