使用预训练 Bert 进行二元分类的形状值:如何提取摘要图?

问题描述 投票:0回答:1

我使用预训练的bert模型进行二元分类。用小数据训练模型后,我想提取像这样的摘要图the graph I want。然而,我想用文字来代替这些重要的特征。

但是,我不确定一切都好,因为 shap_value 的形状只是二维的。其实,这是有道理的。尽管如此,我没有得到图表,因为如果我使用这段代码,我遇到了两个问题:

shap.summary_plot(shap_values[:,:10],feature_names=feature_importance['features'].tolist(),features=comments_text)`

问题太不理智了:如果我把

shap_values[:,:10]
改成
shap_values
shap_values[0]
shap_values.values
vb。我总是遇到

516: assert len(shap_values.shape) != 1, "Summary plots need a matrix of 
shap_values, not a vector." ==> AssertionError: Summary plots need a matrix of 
shap_values, not a vector.

(拳头问题)

顺便说一句,我的 shap_value 由 10 个输入(shape_value.shape)组成。如果我选择范围从 1 到 147 的最大值,那么绘制图表就一切顺利。然而,此时,该图不合适:我的图仅由蓝点组成(-第二个问题-)。像这样only blue not

注意:

shap_values[:,:10]
如果数字(10)改变不同的数字,图形显示不同的单词,但图形的总数相同(最多20个)。只有部分单词顺序可以改变。

最小可重现示例:

import nlp
import numpy as np
import pandas as pd
import scipy as sp
import torch
import transformers
import torch
import shap

# load a BERT sentiment analysis model
tokenizer = transformers.DistilBertTokenizerFast.from_pretrained(
    "distilbert-base-uncased"
)
model = transformers.DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
).cuda()


if torch.cuda.is_available():
    device = torch.device("cuda")
    print('We will use the GPU:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

def f(x):
    # Encode the batch of sentenc
    inputs = tokenizer.batch_encode_plus(x.tolist(), max_length=450,add_special_tokens=True, return_attention_mask=True,padding='max_length',truncation=True,return_tensors='pt')

    # Send the tensors to the same device as the model
    input_ids = inputs['input_ids'].to(device)
    attention_masks = inputs['attention_mask'].to(device)
    # Predict
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_masks)[0].detach().cpu().numpy()
    scores = (np.exp(outputs).T / np.exp(outputs).sum(-1)).T
    val = sp.special.logit(scores[:, 1])  # use one vs rest logit units
    return val
# Build an explainer using a token masker
explainer = shap.Explainer(f, tokenizer )

imdb_train = nlp.load_dataset("imdb")["train"]
shap_values = explainer(imdb_train[:10], fixed_context=1, batch_size=16)
cohorts = {"": shap_values}
cohort_labels = list(cohorts.keys())
cohort_exps = list(cohorts.values())
for i in range(len(cohort_exps)):
    if len(cohort_exps[i].shape) == 2:
        cohort_exps[i] = cohort_exps[i].abs.mean(0)
features = cohort_exps[0].data
feature_names = cohort_exps[0].feature_names
#values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))], dtype=object)
values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
feature_importance = pd.DataFrame(list(zip(feature_names, sum(values))), columns=['features', 'importance'])
feature_importance.sort_values(by=['importance'], ascending=False, inplace=True)
shap.summary_plot(shap_values[:,:10],feature_names=feature_importance['features'].tolist(),features=imdb_train['text'][10:20],show=False)

上面的代码产生相同的结果。我花了大约200台电脑,但我没有成功:(。我该怎么办?

python machine-learning bert-language-model text-classification shap
1个回答
0
投票

你会尝试吗:

sv = np.array([arr[:100] for arr in shap_values.values])
data = np.array([arr[:100] for arr in shap_values.data])
shap.summary_plot(sv, data, feature_names=feature_importance['features'].tolist())
© www.soinside.com 2019 - 2024. All rights reserved.