_batch_encode_plus() 得到了意外的关键字参数“return_attention_masks”

问题描述 投票:0回答:2

我正在研究 RoBERTA 模型来检测推文中的情绪。 在谷歌合作实验室上。遵循 Kaggle 的此 Notebook 文件 - https://www.kaggle.com/ishivinal/tweet-emotions-analysis-using-lstm-glove-roberta?scriptVersionId=38608295

代码片段:

def regular_encode(texts, tokenizer, maxlen=512):
     enc_di = tokenizer.batch_encode_plus(
         texts, 
         return_attention_masks=True, 
         return_token_type_ids=False,
         pad_to_max_length=True,
         #padding=True,
         max_length=maxlen
     )
    
     return np.array(enc_di['input_ids'])

 def build_model(transformer, max_len=160):
     input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
     sequence_output = transformer(input_word_ids)[0]
     cls_token = sequence_output[:, 0, :]
     out = Dense(13, activation='softmax')(cls_token)
    
     model = Model(inputs=input_word_ids, outputs=out)
     model.compile(Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])
    
     return model


AUTO = tf.data.experimental.AUTOTUNE
MODEL = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL)


X_train_t = regular_encode(X_train, tokenizer, maxlen= max_len)
X_test_t = regular_encode(X_test, tokenizer, maxlen=max_len)

在常规编码部分我收到以下错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-101-4e1e74c2ea8f> in <module>()
----> 1 X_train_t = regular_encode(X_train, tokenizer, maxlen= max_len)
      2 X_test_t = regular_encode(X_test, tokenizer, maxlen=max_len)

2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/tokenization_gpt2_fast.py in _batch_encode_plus(self, *args, **kwargs)
    161         )
    162 
--> 163         return super()._batch_encode_plus(*args, **kwargs)
    164 
    165     def _encode_plus(self, *args, **kwargs) -> BatchEncoding:

TypeError: _batch_encode_plus() got an unexpected keyword argument 'return_attention_masks'
python nlp google-colaboratory bert-language-model roberta-language-model
2个回答
0
投票

尝试按如下方式更改第一个功能:

def regular_encode(texts, tokenizer, maxlen=512):
 enc_di = tokenizer.batch_encode_plus(
     texts, 
     return_attention_masks=False, 
     return_token_type_ids=False,
     pad_to_max_length=True,
     #padding=True,
     max_length=maxlen
 )

 return np.array(enc_di['input_ids'])

0
投票

您需要一起删除“删除注意力蒙版”参数:

def regular_encode(texts, tokenizer, maxlen=512):
     enc_di = tokenizer.batch_encode_plus(
         texts, 
         return_token_type_ids=False,
         pad_to_max_length=True,
         #padding=True,
         max_length=maxlen
     )
    
     return np.array(enc_di['input_ids'])

此外,您需要确保文本以列表格式给出,如下所示:

X_train=regular_encode(X_train.text.values.tolist(), tokenizer, maxlen= max_len)
© www.soinside.com 2019 - 2024. All rights reserved.