我对 sklearn 在已建立的 CV 和管道框架内估算值的方式有一些问题。所有这些都是为了避免全局插补,这将由于数据泄漏而扰乱模型性能。环顾几个链接和指南,我混合搭配并确定了以下代码片段中的内容。我试图在几个线性模型上使用它,但这个例子将坚持使用套索。
我的数据集由 95 个数值参数和 5 个分类参数组成。在总共 100 个观测值中,NaN 贯穿此处(8-27%,按列)。我的回复中没有 NaN,y.
在这里,我尝试使用 KNN 进行插补,相应地缩放数据,对于分类变量,分别使用最频繁的插补和一次热编码。
from sklearn import linear_model
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.model_selection import KFold
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LassoLars
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
# Data prep
"""
We have 100 parameters; 95 are numerical and 5 are categorical.
Each row contains numerous missing values, which we need to impute *inside* the CV-loop
"""
df = pd.read_csv(dataPath, delimiter=',', skipinitialspace = True)
y = df.loc[:, 'y']
df = df.drop(['y'], axis=1)
df_num = df[df.columns[:95]]
df_cat = df[df.columns[95:]]
num_na = df_num.columns.to_list()
cat_na = df_cat.columns.to_list()
# Apply KNNImputing, scale afterwards
numeric_transformer = Pipeline(steps=[
('imputer', KNNImputer(n_neighbors=2, weights="uniform")),
('scaler', StandardScaler())])
# Most common occurrence imputing and dummy encoding
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OrdinalEncoder())])
# Do it for the established columns in df
preprocessor = ColumnTransformer(
remainder = 'passthrough',
transformers=[
('numeric', numeric_transformer, num_na),
('categorical', categorical_transformer, cat_na)
])
# Nested CV can be performed on the configured GridSearchCV directly that will
# automatically use the refit best performing model on the test set from the outer loop
cv_inner = KFold(n_splits=3, shuffle=True, random_state=42) # Parameter validation
cv_outer = KFold(n_splits=5, shuffle=True, random_state=42) # Model validation
# define the model
pipe = Pipeline(steps=[('preprocessing', preprocessor),
('clf', Lasso())]
)
# Gridsearch over parameter grid
alpha_grid = np.logspace(-4, 3, 100)
param_grid = [{'clf__alpha': alpha_grid}]
grid = GridSearchCV(pipe, cv=cv_inner, param_grid=param_grid, verbose=1,
return_train_score=True, scoring='neg_root_mean_squared_error',
refit=True, n_jobs=-1)
scores = cross_val_score(grid, df, y, scoring='neg_root_mean_squared_error',
cv=cv_outer, n_jobs=-1)
print('Avg. RMSE across outer CV: %.3f ' % (np.mean(-scores)))
执行此框架时我得到
RuntimeWarning: invalid value encountered in divide
* (last_sum / last_over_new_count - new_sum) ** 2
逃避被 NaN 的现在所除。在这方面是否有任何聪明的 scikit-learn 头脑可以检查我的理智?
提前致谢。