我使用预训练的bert模型进行二元分类。用小数据训练模型后,我想提取像这样的摘要图。然而,我想用文字来代替这些重要的特征。
但是,我不确定一切都好,因为 shap_value 的形状只是二维的。其实,这是有道理的。尽管如此,我没有得到图表,因为如果我使用这段代码,我遇到了两个问题:
shap.summary_plot(shap_values[:,:10],feature_names=feature_importance['features'].tolist(),features=comments_text)`
问题太不理智了:如果我把
shap_values[:,:10]
改成shap_values
或shap_values[0]
或shap_values.values
vb。我总是遇到
516: assert len(shap_values.shape) != 1, "Summary plots need a matrix of
shap_values, not a vector." ==> AssertionError: Summary plots need a matrix of
shap_values, not a vector.
(拳头问题)
顺便说一句,我的 shap_value 由 10 个输入(shape_value.shape)组成。如果我选择范围从 1 到 147 的最大值,那么绘制图表就一切顺利。然而,此时,该图不合适:我的图仅由蓝点组成(-第二个问题-)。像这样。
注意:
shap_values[:,:10]
如果数字(10)改变不同的数字,图形显示不同的单词,但图形的总数相同(最多20个)。只有部分单词顺序可以改变。
最小可重现示例:
import nlp
import numpy as np
import pandas as pd
import scipy as sp
import torch
import transformers
import torch
import shap
# load a BERT sentiment analysis model
tokenizer = transformers.DistilBertTokenizerFast.from_pretrained(
"distilbert-base-uncased"
)
model = transformers.DistilBertForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
).cuda()
if torch.cuda.is_available():
device = torch.device("cuda")
print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
print('No GPU available, using the CPU instead.')
device = torch.device("cpu")
def f(x):
# Encode the batch of sentenc
inputs = tokenizer.batch_encode_plus(x.tolist(), max_length=450,add_special_tokens=True, return_attention_mask=True,padding='max_length',truncation=True,return_tensors='pt')
# Send the tensors to the same device as the model
input_ids = inputs['input_ids'].to(device)
attention_masks = inputs['attention_mask'].to(device)
# Predict
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_masks)[0].detach().cpu().numpy()
scores = (np.exp(outputs).T / np.exp(outputs).sum(-1)).T
val = sp.special.logit(scores[:, 1]) # use one vs rest logit units
return val
# Build an explainer using a token masker
explainer = shap.Explainer(f, tokenizer )
imdb_train = nlp.load_dataset("imdb")["train"]
shap_values = explainer(imdb_train[:10], fixed_context=1, batch_size=16)
cohorts = {"": shap_values}
cohort_labels = list(cohorts.keys())
cohort_exps = list(cohorts.values())
for i in range(len(cohort_exps)):
if len(cohort_exps[i].shape) == 2:
cohort_exps[i] = cohort_exps[i].abs.mean(0)
features = cohort_exps[0].data
feature_names = cohort_exps[0].feature_names
#values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))], dtype=object)
values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
feature_importance = pd.DataFrame(list(zip(feature_names, sum(values))), columns=['features', 'importance'])
feature_importance.sort_values(by=['importance'], ascending=False, inplace=True)
shap.summary_plot(shap_values[:,:10],feature_names=feature_importance['features'].tolist(),features=imdb_train['text'][10:20],show=False)
上面的代码产生相同的结果。我花了大约200台电脑,但我没有成功:(。我该怎么办?
你会尝试吗:
sv = np.array([arr[:100] for arr in shap_values.values])
data = np.array([arr[:100] for arr in shap_values.data])
shap.summary_plot(sv, data, feature_names=feature_importance['features'].tolist())