我使用
实现了回归模型formula= "cost ~ C(state) + group_size + C(homeowner) + car_age + C(car_value) +
risk_factor + age_oldest + age_youngest + C(married_couple) + c_previous +
duration_previous + C(a) + C(b) + C(c) + C(d) + C(e) + C(f) + C(g)"
model_a = smf.ols(formula = formula, data = train).fit()
model_a.summary()
拟合回归模型后,我使用
进行了 bonferroni 校正smt.multipletests(model_a.pvalues, alpha=0.05, method='bonferroni', is_sorted=False,
returnsorted=False)
我得到以下结果:
(array([ True, False, True, True, True, True, True, False, True,
True, True, False, True, True, True, True, False, False,
False, False, True, False, True, True, True, True, True,
True, True, False, True, True, False, True, True, False,
True, True, True, True, True, True, True, True, False,
True, True, True, False, False, False, False, False, False,
True, True, True, True, True, False, True, False, True,
False, True, True, True, True]),
array([0.00000000e+00, 1.00000000e+00, 1.45352365e-03, 2.14422252e-21,
5.68726115e-13, 4.81466313e-12, 1.22517937e-05, 3.36565323e-01,
4.81396354e-45, 1.51138583e-05, 4.27572151e-04, 1.00000000e+00,
5.91690245e-10, 2.62041907e-16, 3.12129589e-18, 9.88879325e-13,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 6.85853188e-01,
8.94886169e-07, 1.00000000e+00, 3.55801455e-12, 5.35987286e-54,
7.77655333e-03, 5.45090922e-04, 5.15690091e-03, 7.40791788e-04,
1.24797586e-07, 1.00000000e+00, 2.91991310e-04, 1.75502703e-07,
1.00000000e+00, 2.57023089e-26, 2.34824045e-10, 1.00000000e+00,
2.79360586e-87, 5.26115182e-09, 4.94812967e-08, 3.36073545e-07,
5.06333547e-07, 4.44900552e-07, 1.06078148e-05, 1.42866234e-03,
1.00000000e+00, 3.72074539e-10, 1.38294896e-74, 1.39540646e-69,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
1.00000000e+00, 1.00000000e+00, 2.78538149e-18, 3.74576314e-22,
1.12111501e-19, 1.14698339e-04, 9.34411232e-18, 1.00000000e+00,
4.10430857e-02, 1.00000000e+00, 5.35030644e-23, 1.00000000e+00,
7.61651080e-20, 9.49735915e-56, 7.90523832e-66, 8.15390766e-94]),
0.0007540287301109894,
0.0007352941176470588)
我想使用这些数组来删除 model_a 中为 False 的特征并创建一个新模型“train_simplified”。
我正在使用以下手动方法,但我想知道是否有更有效的方法。
train_simplified = train.drop(train.columns[[0, 1, 2, 4, 10, 16, 25, 27, 28, 30, 36, 38,
41, 44, 47, 55, 61, 62, 63, 64, 65, 66, 67, 68, 69, 75, 78]], axis=1)
loc
仅选择 model_a
中的 True
功能。
.loc[] 主要基于标签,但也可以与布尔数组一起使用。
train = pd.DataFrame(np.random.rand(5,68))
0 1 2 3 ... 63 64 65 66 67
0 0.637557 0.887213 0.472215 0.119594 ... 0.908266 0.239562 0.144895 0.489453 0.985650
1 0.242055 0.672136 0.761620 0.237638 ... 0.649633 0.849223 0.657613 0.568309 0.093675
2 0.367716 0.265202 0.243990 0.973011 ... 0.465598 0.542645 0.286541 0.590833 0.030500
3 0.037348 0.822601 0.360191 0.127061 ... 0.070569 0.642419 0.026511 0.585776 0.940230
4 0.575474 0.388170 0.643288 0.458253 ... 0.091206 0.494420 0.057559 0.549529 0.441531
[5 rows x 68 columns]
keep_columns = np.array([ # array from smt.multipletests
True, False, True, True, True, True, True, False, True,
True, True, False, True, True, True, True, False, False,
False, False, True, False, True, True, True, True, True,
True, True, False, True, True, False, True, True, False,
True, True, True, True, True, True, True, True, False,
True, True, True, False, False, False, False, False, False,
True, True, True, True, True, False, True, False, True,
False, True, True, True, True])
np.sum(keep_columns) # 47 (keep 47 columns)
train_simplified = train.loc[:,keep_columns]
train_simplified
的输出 0 2 3 4 ... 62 64 65 66 67
0 0.637557 0.472215 0.119594 0.713245 ... 0.278646 0.239562 0.144895 0.489453 0.985650
1 0.242055 0.761620 0.237638 0.728216 ... 0.746491 0.849223 0.657613 0.568309 0.093675
2 0.367716 0.243990 0.973011 0.393098 ... 0.035942 0.542645 0.286541 0.590833 0.030500
3 0.037348 0.360191 0.127061 0.522243 ... 0.162934 0.642419 0.026511 0.585776 0.940230
4 0.575474 0.643288 0.458253 0.545617 ... 0.789618 0.494420 0.057559 0.549529 0.441531
[5 rows x 47 columns]