RandomizedSearchCV 的 LGBM 自定义评分功能

问题描述 投票:0回答:1

我想通过使用自定义评分函数来优化带有 RandomizedSearchCV 的多类分类 LGBM 模型的参数。此自定义评分函数需要额外的数据,这些数据不得用于训练,但需要用于计算分数。

我有我的

features_train
数据框,其中包含训练必须使用的所有功能以及计算分数所需的附加数据,以及我的
target_train
系列。

我定义

import lightgbm as lgb


random_search = RandomizedSearchCV(
    lgb.LGBMClassifier(),
    param_distributions=param_dist,
    cv=5,
    scoring=get_scoring_function(),
    n_iter=100,
    random_state=41,
    n_jobs=30,
    verbose=0
)

哪里

from sklearn.metrics import make_scorer

def get_scoring_function():

    def lgbm_scorer(clf, X, y):
        scoring_info: List[List[float]] = X[self.scoring_info_cols].values.tolist()
        X = X.drop(columns=self.scoring_info_cols)
        custom_metric = get_custom_metric(scoring_info=scoring_info)
        dataset = lgb.Dataset(X, label=y)
        preds = clf.predict(X)
        return custom_metric(preds, dataset)[1]

    return make_scorer(lgbm_scorer, greater_is_better=True)[1]

其中

get_custom_metric
定义为:

def get_custom_metric(scoring_info: List[List[float]]) -> Callable:
    def my_metric(y_pred: np.ndarray, y_true: lgb.Dataset) -> Tuple[str, float, bool]:
        y_labels: np.ndarray = y_true.get_label()
        y_pred_classes = np.argmax(y_pred, axis=1)
        fold_indices = y_true.get_data().index
        these_scoring: List[List[float]] = [scoring_info[i] for i in fold_indices]
        all_scores: List[float] = [these_scoring[i][y_pred_classes[i]] - these_scoring[i][int(y_labels[i])] for i in range(len(y_labels))]
        return "MyMetric", sum(all_scores), True
    return my_metric

当我运行

random_search.fit(features_train, target_train)
时,我收到错误:

joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/queues.py", line 159, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 215, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 208, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
ValueError: ctypes objects containing pointers cannot be pickled
"""

此错误是由于

lgbm_scorer
不是“pickleable”而引起的,这可能是由于
lgbm_scorer
是一个复杂的嵌套函数。

知道如何解决这个问题吗?我可以通过将额外的

scoring_info
传递给
my_metric
来简化函数,而无需定义外部函数
get_custom_metric
。知道如何在不使用额外的
scoring_info
作为模型特征的情况下做到这一点吗?

python optimization lightgbm scoring
1个回答
0
投票

我不确定酸洗错误以及它是否实际上与自定义度量函数有关。

但我认为将

scoring_info
列传递给记分器而不是模型本身是很简单的:

dropper = ColumnTransformer(
    [('drop', "drop", scoring_info_cols)],
    remainder="passthrough",
)
model = Pipeline([
    ('drop_scoring_info', dropper),
    ('lgbm', LGBMClassifier()),
])
random_search = RandomizedSearchCV(
    model,
    ...,
)

您可能会想要使用便利函数

make_scorer
,因为这会将带有签名
(y_test, y_pred)
的指标变成带有签名
(estimator, X_test, y_test)
的记分器。既然你想访问整个
X_test
,那么直接定义这样一个记分器就可以了。

© www.soinside.com 2019 - 2024. All rights reserved.