如何在Python中的多类分类问题上获取每个类的SHAP值

问题描述 投票:0回答:6

我有以下数据框:

import pandas as pd
import random

import xgboost
import shap

foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
                   'var1':random.sample(range(1, 100), 10),
                   'var2':random.sample(range(1, 100), 10),
                   'var3':random.sample(range(1, 100), 10),
                   'class': ['a','a','a','a','a','b','b','c','c','c']})

我想运行分类算法来预测 3 个类别。

所以我将数据集分成训练集和测试集,然后运行 xgboost 分类

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
                                                        foo[['class']],
                                                        test_size=0.33, random_state=42)


model = xgboost.XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)

现在我想获取每个类的平均SHAP值,而不是从此代码生成的绝对SHAP值的平均值:

shap_values = shap.TreeExplainer(model).shap_values(X_test) shap.summary_plot(shap_values, X_test)

此外,该图将

class

 标记为 0,1,2。我怎么知道原始的 0,1 & 2 对应哪个类?

因为这段代码:

shap.summary_plot(shap_values, X_test, class_names= ['a', 'b', 'c'])
给予

和这段代码:

shap.summary_plot(shap_values, X_test, class_names= ['b', 'c', 'a'])
给予

所以我不再确定这个传说了。 有什么想法吗?

python python-3.x machine-learning xgboost shap
6个回答
2
投票
我有同样的问题,也许这个问题可以帮助:

https://github.com/slundberg/shap/issues/764

我还没有测试过,但看起来顺序应该和你调用

model.predict_proba()

时的顺序相同。在上面的链接中,建议使用摘要图的 
class_names=model.classes_
 选项。


2
投票
SHAP 值以列表形式返回。您可以通过索引访问相关 SHAP 绝对值。

对于 0 级的摘要图,代码为

shap.summary_plot(shap_values[0], X_test)
    

2
投票
通过进行一些研究并在

这篇文章和@Alessandro Nesti的答案的帮助下,这是我的解决方案:

foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10], 'var1':random.sample(range(1, 100), 10), 'var2':random.sample(range(1, 100), 10), 'var3':random.sample(range(1, 100), 10), 'class': ['a','a','a','a','a','b','b','c','c','c']}) cl_cols = foo.filter(regex='var').columns X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols], foo[['class']], test_size=0.33, random_state=42) model = xgboost.XGBClassifier(objective="multi:softmax") model.fit(X_train, y_train) def get_ABS_SHAP(df_shap,df): #import matplotlib as plt # Make a copy of the input data shap_v = pd.DataFrame(df_shap) feature_list = df.columns shap_v.columns = feature_list df_v = df.copy().reset_index().drop('index',axis=1) # Determine the correlation in order to plot with different colors corr_list = list() for i in feature_list: b = np.corrcoef(shap_v[i],df_v[i])[1][0] corr_list.append(b) corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0) # Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient corr_df.columns = ['Variable','Corr'] corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue') shap_abs = np.abs(shap_v) k=pd.DataFrame(shap_abs.mean()).reset_index() k.columns = ['Variable','SHAP_abs'] k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner') k2 = k2.sort_values(by='SHAP_abs',ascending = True) k2_f = k2[['Variable', 'SHAP_abs', 'Corr']] k2_f['SHAP_abs'] = k2_f['SHAP_abs'] * np.sign(k2_f['Corr']) k2_f.drop(columns='Corr', inplace=True) k2_f.rename(columns={'SHAP_abs': 'SHAP'}, inplace=True) return k2_f foo_all = pd.DataFrame() for k,v in list(enumerate(model.classes_)): foo = get_ABS_SHAP(shap_values[k], X_test) foo['class'] = v foo_all = pd.concat([foo_all,foo]) import plotly_express as px px.bar(foo_all,x='SHAP', y='Variable', color='class')
导致 


1
投票
这是@

quant代码的更新代码:

import pandas as pd import random import numpy as np import xgboost import shap from sklearn.model_selection import train_test_split import plotly_express as px foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10], 'var1':random.sample(range(1, 100), 10), 'var2':random.sample(range(1, 100), 10), 'var3':random.sample(range(1, 100), 10), 'class': ['a','a','a','a','a','b','b','c','c','c']}) foo['class'], _ = pd.factorize(foo['class'], sort = True) cl_cols = foo.filter(regex='var').columns X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols], foo[['class']], test_size=0.33, random_state=42) model = xgboost.XGBClassifier(objective="multi:softmax") model.fit(X_train, y_train) shap_values = shap.TreeExplainer(model).shap_values(X_test) def get_ABS_SHAP(df_shap,df): #import matplotlib as plt # Make a copy of the input data shap_v = pd.DataFrame(df_shap) feature_list = df.columns shap_v.columns = feature_list df_v = df.copy().reset_index().drop('index',axis=1) # Determine the correlation in order to plot with different colors corr_list = list() for i in feature_list: b = np.corrcoef(shap_v[i],df_v[i])[1][0] corr_list.append(b) corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0) # Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient corr_df.columns = ['Variable','Corr'] corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue') shap_abs = np.abs(shap_v) k=pd.DataFrame(shap_abs.mean()).reset_index() k.columns = ['Variable','SHAP_abs'] k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner') k2 = k2.sort_values(by='SHAP_abs',ascending = True) k2_f = k2[['Variable', 'SHAP_abs', 'Corr']] k2_f['SHAP_abs'] = k2_f['SHAP_abs'] * np.sign(k2_f['Corr']) k2_f.drop(columns='Corr', inplace=True) k2_f.rename(columns={'SHAP_abs': 'SHAP'}, inplace=True) return k2_f foo_all = pd.DataFrame() for k,v in list(enumerate(model.classes_)): foo = get_ABS_SHAP(shap_values[k], X_test) foo['class'] = v foo_all = pd.concat([foo_all,foo]) px.bar(foo_all,x='SHAP', y='Variable', color='class')
    

0
投票
恕我直言,定制解决方案过于复杂。

解决方案

shap.summary_plot(shap_values, X_test, class_inds="original", class_names=model.classes_)

解释

    将类名传递给
  • summary_plot
    。这必须反映预测的顺序。由于先验者不知道顺序,因此通常可以使用 
    model.classes_
     来实现这一目的;
  • 指示
  • shap
     坚持预测的原始顺序,而不是对它们进行排序:
    class_inds="original"
    (请参阅相关代码
    此处)。
附注我用

shap 0.40.0


P.P.S。我无法运行您的示例,因为我的 XGBoost 版本不允许使用字符串作为目标类别。但它适用于标签编码目标或其他模型类型(

sklearn.RandomForestClassifier

lgb.LGBMClassifier


0
投票
首先需要使用LabelEncoder,然后是classes_

import pandas as pd import random import xgboost import shap from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10], 'var1':random.sample(range(1, 100), 10), 'var2':random.sample(range(1, 100), 10), 'var3':random.sample(range(1, 100), 10), 'class': ['a','a','a','a','a','b','b','c','c','c']}) cl_cols = foo.filter(regex='var').columns X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols], foo[['class']], test_size=0.33, random_state=42) label_encoder = LabelEncoder() y_train_encoded = label_encoder.fit_transform(y_train.values.ravel()) y_test_encoded = label_encoder.transform(y_test.values.ravel()) model = xgboost.XGBClassifier(objective="multi:softprob", num_class=len(label_encoder.classes_)) model.fit(X_train, y_train_encoded) explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) classes = label_encoder.inverse_transform(range( len(label_encoder.classes_))) shap.summary_plot(shap_values, X_test, class_names=classes)

© www.soinside.com 2019 - 2024. All rights reserved.