如何让 XGBoost 在分层分类器中工作

问题描述 投票:0回答:1

我正在尝试使

XGBoost
与可用的分层分类器包一起使用here(存储库已存档)。

我可以确认该模块与 sklearn 的随机森林分类器(以及我检查过的其他 sklearn 模块)配合良好。但我无法让它与

XGBoost
一起工作。我知道层次分类器需要进行一些修改才能工作,但我无法弄清楚这种修改。

我在下面给出了一个

MWE
来重现该问题(假设该库是通过 -
pip install sklearn-hierarchical-classification
安装的):

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import  load_digits
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn_hierarchical_classification.classifier import HierarchicalClassifier
from sklearn_hierarchical_classification.constants import ROOT
from sklearn_hierarchical_classification.metrics import h_fbeta_score, multi_labeled
#from sklearn_hierarchical_classification.tests.fixtures import make_digits_dataset

我们想要构建以下类层次结构以及手写数字数据集中的数据:

         <ROOT>
          /   \
         A     B
       /  \   /  \
      1   7  C    9
            / \
           3   8

像这样:

def make_digits_dataset(targets=None, as_str=True):
    """Helper function: from sklearn_hierarchical_classification.tests.fixtures module """
    X, y = load_digits(return_X_y=True)
    if targets:
        ix = np.isin(y, targets)
        X, y = X[np.where(ix)], y[np.where(ix)]

    if as_str:
        # Convert targets (classes) to strings
        y = y.astype(str)

    return X, y

class_hierarchy = {
    ROOT: ["A", "B"],
    "A": ["1", "7"],
    "B": ["C", "9"],
    "C": ["3", "8"],
    }

所以:

base1 = RandomForestClassifier()
base2 = XGBClassifier()

clf = HierarchicalClassifier(
    base_estimator=base1,
    class_hierarchy=class_hierarchy,
    )

X, y = make_digits_dataset(targets=[1, 7, 3, 8, 9],
                            as_str=False, )
y = y.astype(str)

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=RANDOM_STATE, )

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

with multi_labeled(y_test, y_pred, clf.graph_) as (y_test_, y_pred_, graph_):
  h_fbeta = h_fbeta_score(
      y_test_, y_pred_, graph_, )

print("h_fbeta_score: ", h_fbeta)
h_fbeta_score:  0.9690011481056257

工作正常。但是使用

XGBClassifier
(
base2
) 会引发以下错误:

Traceback (most recent call last):
  File "~/hierarchical-classification.py", line 62, in <module>
    clf.fit(X_train, y_train)
  File "~/venv/lib/python3.10/site-packages/sklearn_hierarchical_classification/classifier.py", line 206, in fit
    self._recursive_train_local_classifiers(X, y, node_id=self.root, progress=progress)
  File "~/venv/lib/python3.10/site-packages/sklearn_hierarchical_classification/classifier.py", line 384, in _recursive_train_local_classifiers
    self._train_local_classifier(X, y, node_id)
  File "~/venv/lib/python3.10/site-packages/sklearn_hierarchical_classification/classifier.py", line 453, in _train_local_classifier
    clf.fit(X=X_, y=y_)
  File "~/venv/lib/python3.10/site-packages/xgboost/core.py", line 620, in inner_f
    return func(**kwargs)
  File "~/venv/lib/python3.10/site-packages/xgboost/sklearn.py", line 1438, in fit
    or not (self.classes_ == expected_classes).all()
AttributeError: 'bool' object has no attribute 'all'

我知道这个错误与

fit()
中对
xgboost.sklearn.py
方法的调用部分有关:

1436            if (
1437                self.classes_.shape != expected_classes.shape
1438                or not (self.classes_ == expected_classes).all()
1439            ):
1440                raise ValueError(
1441                    f"Invalid classes inferred from unique values of `y`.  "
1442                    f"Expected: {expected_classes}, got {self.classes_}"
1443                )

y
的预期值:
[0 1]
,但得到
['A' 'B']
(内部节点)。 必须有一种方法来修改类
sklearn_hierarchical_classification.classifier.HierarchicalClassifier.py
,以便它可以与
xgboost
正常工作。

这个问题有什么解决办法吗?

python machine-learning scikit-learn xgboost hierarchical
1个回答
0
投票

找到了解决此问题的方法。必须围绕

XGBClassifier
创建一个包装器,如下所示:

from sklearn.base import BaseEstimator

class XGBHierarchicalClassifier(BaseEstimator):
    def __init__(self, **kwargs):
        self.clf = XGBClassifier(**kwargs)
        self.label_map = {}
        self.inverse_label_map = {}
        self.label_counter = 0

    def fit(self, X, y):
        # Map string labels to numeric values
        unique_labels = np.unique(y)
        for label in unique_labels:
            if label not in self.label_map:
                self.label_map[label] = self.label_counter
                self.inverse_label_map[self.label_counter] = label
                self.label_counter += 1
        
        y_numeric = np.array([self.label_map[label] for label in y])

        self.clf.fit(X, y_numeric)
        self.classes_ = np.unique(y)
        return self

    def predict(self, X):
        return np.array([self.inverse_label_map[label] for label in np.argmax(self.predict_proba(X), axis=1)])

    def predict_proba(self, X):
        return self.clf.predict_proba(X)

用途:

base3 = XGBHierarchicalClassifier()

clf = HierarchicalClassifier(
    base_estimator=base3,
    class_hierarchy=class_hierarchy,
    )

with multi_labeled(y_test, y_pred, clf.graph_) as (y_test_, y_pred_, graph_):
  h_fbeta = h_fbeta_score(
      y_test_, y_pred_, graph_, )

print("h_fbeta_score: ", h_fbeta)
h_fbeta_score:  0.8501118568232662
© www.soinside.com 2019 - 2024. All rights reserved.