给定一个“Roberta 是 BERT 的高度优化版本”类型的句子,我需要使用 RoBERTa 获取该句子中每个单词的嵌入。我试图在网上查看示例代码,但未能找到明确的答案。
我的看法如下:
tokens = roberta.encode(headline)
all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
embedding = all_layers[0]
n = embedding.size()[1] - 1
embedding = embedding[:,1:n,:]
其中
embedding[:,1:n,:]
用于仅提取句子中单词的嵌入,而不包含开始和结束标记。
正确吗?
TOKENIZER_PATH = "../input/roberta-transformers-pytorch/roberta-base"
ROBERTA_PATH = "../input/roberta-transformers-pytorch/roberta-base"
text= "How are you? I am good."
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)
##how the words are broken into tokens
print(tokenizer.tokenize(text))
##the format of a encoding
print(tokenizer.batch_encode_plus([text]))
##op wants the input id
print(tokenizer.batch_encode_plus([text])['input_ids'])
##op wants the input id without first and last token
print(tokenizer.batch_encode_plus([text])['input_ids'][0][1:-1])
输出:
['如何', '是', '你', '?', '我', '我', '好', '.']
{'input_ids': [[0, 6179, 32, 47, 116, 38, 524, 205, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
[[0, 6179, 32,47, 116, 38, 524, 205, 4, 2]]
[6179, 32, 47, 116, 38, 524, 205, 4]
并且不用担心“Ġ”字符。它只是表明单词前面有一个空格。
要从 RoBERTa 获取单词嵌入,您可以对组成感兴趣单词的子单词(根据分词器)的嵌入进行平均。还有其他方法。
请记住,RoBERTa(针对 MLM 进行预训练)会生成上下文相关的嵌入,并且使用 RoBERTa 而不对其进行微调以从单个单词(没有任何上下文)生成嵌入可能在下游任务中表现不佳。
假设您对嵌入预测的
<mask>
标记不感兴趣,类似这样的事情应该有效:
def get_hidden_states(encoded, token_ids_word, model, layers):
# inference
with torch.no_grad():
output = model(**encoded)
# get all hidden states
states = output.hidden_states
# stack and sum layers
output = torch.stack([states[i] for i in layers]).sum(0).squeeze()
# subset for tokens that make up the word
word_tokens_output = output[token_ids_word]
return word_tokens_output.mean(dim=0)
def get_word_vector(sent, idx, tokenizer, model, layers, device):
# tokenize the input sentence and sending to device
encoded = tokenizer.encode_plus(sent, return_tensors="pt").to(device)
# get all token idxs that make up the word
token_ids_word = np.where(np.array(encoded.word_ids()) == idx)
# get all hidden states
return get_hidden_states(encoded, token_ids_word, model, layers)
def get_embedding(model, tokenizer, sent, word, device, layers=None):
# using last four layers by default
layers = [-4, -3, -2, -1] if layers is None else layers
# get idx for the word
idx = sent.split(" ").index(word)
# get word embedding
word_embedding = get_word_vector(sent, idx, tokenizer, model, layers, device)
return word_embedding