我正在开发一个使用 GPT-2 计算单词和句子嵌入的程序,特别是
GPT2Model
类。对于词嵌入,我在将形状为 outputs[0]
的 input_ids
转发到 batch size x seq len
类之后提取最后一个隐藏状态 GPT2Model
。至于句子嵌入,我在序列末尾提取单词的隐藏状态。这是我尝试过的代码:
from transformers import GPT2Tokenizer, GPT2Model
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
captions = ["example caption", "example bird", "the bird is yellow has red wings", "hi", "very good"]
encoded_captions = [tokenizer.encode(caption) for caption in captions]
# Pad sequences to the same length with 0s
max_len = max(len(seq) for seq in encoded_captions)
padded_captions = [seq + [0] * (max_len - len(seq)) for seq in encoded_captions]
# Convert to a PyTorch tensor with batch size 5
input_ids = torch.tensor(padded_captions)
outputs = model(input_ids)
word_embedding = outputs[0].contiguous()
sentence_embedding = word_embedding[ :, -1, : ].contiguous()
我不确定我对单词和句子嵌入的计算是否正确,有人可以帮我确认一下吗?
这是计算句子和单词嵌入的修改后的代码:
from transformers import GPT2Tokenizer, GPT2Model
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2Model.from_pretrained('gpt2')
captions = [
"example caption",
"example bird",
"the bird is yellow has red wings",
"hi",
"very good"
]
# Tokenize and pad sequences
encoded_captions = tokenizer(
captions,
return_tensors='pt',
padding=True,
truncation=True
)
input_ids = encoded_captions['input_ids']
# Forward pass to get embeddings
with torch.no_grad():
outputs = model(input_ids)
# Extract embeddings
word_embeddings = outputs.last_hidden_state
# Mask to ignore padding tokens
masked_word_embeddings = word_embeddings * encoded_captions.attention_mask.unsqueeze(-1).float()
# Sum pooling considering only non-padding tokens
sentence_embeddings = masked_word_embeddings.sum(dim=1)
# Normalize by the count of non-padding tokens
sentence_embeddings /= attention_mask.sum(dim=1, keepdim=True).float()
一些相关事实:
word_embeddings.shape
>> torch.Size([5, 7, 768])
这意味着某些句子具有不存在标记的嵌入,因此我们需要屏蔽输出以仅考虑存在的标记
print(masked_word_embeddings)
>> tensor([[[-0.2835, -0.0469, -0.5029, ..., -0.0525, -0.0089, -0.1395],
[-0.2636, -0.1355, -0.4277, ..., -0.3552, 0.0437, -0.2479],
[ 0.0000, -0.0000, 0.0000, ..., 0.0000, -0.0000, -0.0000],
...,
[ 0.0000, -0.0000, 0.0000, ..., 0.0000, -0.0000, -0.0000],
[ 0.0000, -0.0000, 0.0000, ..., 0.0000, -0.0000, -0.0000],
[ 0.0000, -0.0000, 0.0000, ..., 0.0000, -0.0000, -0.0000]],
...
sentence_embeddings = masked_word_embeddings.mean(dim=1)
sentence_embeddings = masked_word_embeddings.max(dim=1)
存在很多技术,这取决于嵌入如何完成您的任务。我会选择一种方法,使我认为与我的任务相似的向量之间的余弦相似度最大化。例如:如果总和比平均值更相似,则可能更合适。