如何获得 RoBERTa 词嵌入?

问题描述 投票:0回答:2

给定一个“Roberta 是 BERT 的高度优化版本”类型的句子,我需要使用 RoBERTa 获取该句子中每个单词的嵌入。我试图在网上查看示例代码,但未能找到明确的答案。

我的看法如下:

tokens = roberta.encode(headline)
all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
embedding = all_layers[0]
n = embedding.size()[1] - 1
embedding = embedding[:,1:n,:]

其中

embedding[:,1:n,:]
用于仅提取句子中单词的嵌入,而不包含开始和结束标记。

正确吗?

encoding nlp word-embedding
2个回答
0
投票
TOKENIZER_PATH = "../input/roberta-transformers-pytorch/roberta-base"
ROBERTA_PATH = "../input/roberta-transformers-pytorch/roberta-base"

text= "How are you? I am good."
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)

##how the words are broken into tokens
print(tokenizer.tokenize(text))

##the format of a encoding
print(tokenizer.batch_encode_plus([text]))

##op wants the input id
print(tokenizer.batch_encode_plus([text])['input_ids'])

##op wants the input id without first and last token
print(tokenizer.batch_encode_plus([text])['input_ids'][0][1:-1])

输出:

['如何', '是', '你', '?', '我', '我', '好', '.']

{'input_ids': [[0, 6179, 32, 47, 116, 38, 524, 205, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

[[0, 6179, 32,47, 116, 38, 524, 205, 4, 2]]

[6179, 32, 47, 116, 38, 524, 205, 4]

并且不用担心“Ġ”字符。它只是表明单词前面有一个空格。


0
投票

要从 RoBERTa 获取单词嵌入,您可以对组成感兴趣单词的子单词(根据分词器)的嵌入进行平均。还有其他方法。

请记住,RoBERTa(针对 MLM 进行预训练)会生成上下文相关的嵌入,并且使用 RoBERTa 而不对其进行微调以从单个单词(没有任何上下文)生成嵌入可能在下游任务中表现不佳。

假设您对嵌入预测的

<mask>
标记不感兴趣,类似这样的事情应该有效:

def get_hidden_states(encoded, token_ids_word, model, layers):
    # inference
    with torch.no_grad():
        output = model(**encoded)
    # get all hidden states
    states = output.hidden_states
    # stack and sum layers
    output = torch.stack([states[i] for i in layers]).sum(0).squeeze()
    # subset for tokens that make up the word
    word_tokens_output = output[token_ids_word]
    return word_tokens_output.mean(dim=0)
 
def get_word_vector(sent, idx, tokenizer, model, layers, device):
    # tokenize the input sentence and sending to device
    encoded = tokenizer.encode_plus(sent, return_tensors="pt").to(device)
    # get all token idxs that make up the word 
    token_ids_word = np.where(np.array(encoded.word_ids()) == idx)
    # get all hidden states
    return get_hidden_states(encoded, token_ids_word, model, layers)

def get_embedding(model, tokenizer, sent, word, device, layers=None):
    # using last four layers by default
    layers = [-4, -3, -2, -1] if layers is None else layers
    # get idx for the word
    idx = sent.split(" ").index(word)
    # get word embedding
    word_embedding = get_word_vector(sent, idx, tokenizer, model, layers, device)
    return word_embedding
© www.soinside.com 2019 - 2024. All rights reserved.