我想通过使用自定义评分函数来优化带有 RandomizedSearchCV 的多类分类 LGBM 模型的参数。此自定义评分函数需要额外的数据,这些数据不得用于训练,但需要用于计算分数。
我有我的
features_train
数据框,其中包含训练必须使用的所有功能以及计算分数所需的附加数据,以及我的 target_train
系列。
我定义
import lightgbm as lgb
random_search = RandomizedSearchCV(
lgb.LGBMClassifier(),
param_distributions=param_dist,
cv=5,
scoring=get_scoring_function(),
n_iter=100,
random_state=41,
n_jobs=30,
verbose=0
)
哪里
from sklearn.metrics import make_scorer
def get_scoring_function():
def lgbm_scorer(clf, X, y):
scoring_info: List[List[float]] = X[self.scoring_info_cols].values.tolist()
X = X.drop(columns=self.scoring_info_cols)
custom_metric = get_custom_metric(scoring_info=scoring_info)
dataset = lgb.Dataset(X, label=y)
preds = clf.predict(X)
return custom_metric(preds, dataset)[1]
return make_scorer(lgbm_scorer, greater_is_better=True)[1]
其中
get_custom_metric
定义为:
def get_custom_metric(scoring_info: List[List[float]]) -> Callable:
def my_metric(y_pred: np.ndarray, y_true: lgb.Dataset) -> Tuple[str, float, bool]:
y_labels: np.ndarray = y_true.get_label()
y_pred_classes = np.argmax(y_pred, axis=1)
fold_indices = y_true.get_data().index
these_scoring: List[List[float]] = [scoring_info[i] for i in fold_indices]
all_scores: List[float] = [these_scoring[i][y_pred_classes[i]] - these_scoring[i][int(y_labels[i])] for i in range(len(y_labels))]
return "MyMetric", sum(all_scores), True
return my_metric
当我运行
random_search.fit(features_train, target_train)
时,我收到错误:
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/queues.py", line 159, in _feed
obj_ = dumps(obj, reducers=reducers)
File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 215, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 208, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump
return Pickler.dump(self, obj)
ValueError: ctypes objects containing pointers cannot be pickled
"""
此错误是由于
lgbm_scorer
不是“pickleable”而引起的,这可能是由于 lgbm_scorer
是一个复杂的嵌套函数。
知道如何解决这个问题吗?我可以通过将额外的
scoring_info
传递给 my_metric
来简化函数,而无需定义外部函数 get_custom_metric
。知道如何在不使用额外的 scoring_info
作为模型特征的情况下做到这一点吗?
我不确定酸洗错误以及它是否实际上与自定义度量函数有关。
但我认为将
scoring_info
列传递给记分器而不是模型本身是很简单的:
dropper = ColumnTransformer(
[('drop', "drop", scoring_info_cols)],
remainder="passthrough",
)
model = Pipeline([
('drop_scoring_info', dropper),
('lgbm', LGBMClassifier()),
])
random_search = RandomizedSearchCV(
model,
...,
)
您可能会想要不使用便利函数
make_scorer
,因为这会将带有签名(y_test, y_pred)
的指标变成带有签名(estimator, X_test, y_test)
的记分器。既然你想访问整个X_test
,那么直接定义这样一个记分器就可以了。