使用预训练 Bert 进行二元分类的形状值：如何提取摘要图？

Question

我使用预训练的bert模型进行二元分类。用小数据训练模型后，我想提取像这样的摘要图。然而，我想用文字来代替这些重要的特征。

但是，我不确定一切都好，因为 shap_value 的形状只是二维的。其实，这是有道理的。尽管如此，我没有得到图表，因为如果我使用这段代码，我遇到了两个问题：

shap.summary_plot(shap_values[:,:10],feature_names=feature_importance['features'].tolist(),features=comments_text)`

问题太不理智了：如果我把

shap_values[:,:10]

改成

shap_values

或

shap_values[0]

或

shap_values.values

vb。我总是遇到

516: assert len(shap_values.shape) != 1, "Summary plots need a matrix of 
shap_values, not a vector." ==> AssertionError: Summary plots need a matrix of 
shap_values, not a vector.

（拳头问题）

顺便说一句，我的 shap_value 由 10 个输入（shape_value.shape）组成。如果我选择范围从 1 到 147 的最大值，那么绘制图表就一切顺利。然而，此时，该图不合适：我的图仅由蓝点组成（-第二个问题-）。像这样。

注意：

shap_values[:,:10]

如果数字（10）改变不同的数字，图形显示不同的单词，但图形的总数相同（最多20个）。只有部分单词顺序可以改变。

最小可重现示例：

import nlp
import numpy as np
import pandas as pd
import scipy as sp
import torch
import transformers
import torch
import shap

# load a BERT sentiment analysis model
tokenizer = transformers.DistilBertTokenizerFast.from_pretrained(
    "distilbert-base-uncased"
)
model = transformers.DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
).cuda()


if torch.cuda.is_available():
    device = torch.device("cuda")
    print('We will use the GPU:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

def f(x):
    # Encode the batch of sentenc
    inputs = tokenizer.batch_encode_plus(x.tolist(), max_length=450,add_special_tokens=True, return_attention_mask=True,padding='max_length',truncation=True,return_tensors='pt')

    # Send the tensors to the same device as the model
    input_ids = inputs['input_ids'].to(device)
    attention_masks = inputs['attention_mask'].to(device)
    # Predict
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_masks)[0].detach().cpu().numpy()
    scores = (np.exp(outputs).T / np.exp(outputs).sum(-1)).T
    val = sp.special.logit(scores[:, 1])  # use one vs rest logit units
    return val
# Build an explainer using a token masker
explainer = shap.Explainer(f, tokenizer )

imdb_train = nlp.load_dataset("imdb")["train"]
shap_values = explainer(imdb_train[:10], fixed_context=1, batch_size=16)
cohorts = {"": shap_values}
cohort_labels = list(cohorts.keys())
cohort_exps = list(cohorts.values())
for i in range(len(cohort_exps)):
    if len(cohort_exps[i].shape) == 2:
        cohort_exps[i] = cohort_exps[i].abs.mean(0)
features = cohort_exps[0].data
feature_names = cohort_exps[0].feature_names
#values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))], dtype=object)
values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
feature_importance = pd.DataFrame(list(zip(feature_names, sum(values))), columns=['features', 'importance'])
feature_importance.sort_values(by=['importance'], ascending=False, inplace=True)
shap.summary_plot(shap_values[:,:10],feature_names=feature_importance['features'].tolist(),features=imdb_train['text'][10:20],show=False)

上面的代码产生相同的结果。我花了大约200台电脑，但我没有成功:(。我该怎么办？

Answer 1

你会尝试吗：

sv = np.array([arr[:100] for arr in shap_values.values])
data = np.array([arr[:100] for arr in shap_values.data])
shap.summary_plot(sv, data, feature_names=feature_importance['features'].tolist())

使用预训练 Bert 进行二元分类的形状值：如何提取摘要图？

问题描述投票：0回答：1

1个回答

最新问题

使用预训练 Bert 进行二元分类的形状值：如何提取摘要图？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1