我正在从火炬中心加载语言模型(CamemBERT一种基于法国RoBERTa的法语模型,并使用它嵌入一些句子:
import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
camembert.eval() # disable dropout (or leave in train mode to finetune)
def embed(sentence):
tokens = camembert.encode(sentence)
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
embeddings = all_layers[0]
return embeddings
# Here we see that the shape of the embedding vector is dependent to number of tokens in the sentence
u = embed("Bonjour, ça va ?")
u.shape # torch.Size([1, 7, 768])
v = embed("Salut, comment vas-tu ?")
v.shape # torch.Size([1, 9, 768])
现在想象,我想计算向量之间的cosine distance
(在我们的情况下为张量)u
和v
:
cos = torch.nn.CosineSimilarity(dim=0)
cos(u, v) #will throw an error since the shape of `u` is different from the shape of `v``
我在问什么是最好的方法,以便始终获得句子的相同嵌入形状,而不考虑标记的数量?
我想计算mean on axis=1
,因为axis = 0和axis = 2的大小始终相同:
cos = torch.nn.CosineSimilarity(dim=1) #dim becomes 1 now
u = u.mean(axis=1)
v = v.mean(axis=1)
cos(u, v).detach().numpy().item() # works now and gives 0.7269
但是,恐怕在计算均值时会损害嵌入!
我不是专家,但是,为什么不使用最后一层呢?您是否要保留所有图层的目的?
对于最后一层,大小为常数[1、10、768],它应允许您进行一些计算。我还没有尝试用它来聚集一些句子。
让我知道我是否对您有所帮助!