NLP变形金刚：获得固定句子嵌入矢量形状的最佳方法？

Question

我正在从火炬中心加载语言模型（CamemBERT一种基于法国RoBERTa的法语模型，并使用它嵌入一些句子：

import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
camembert.eval()  # disable dropout (or leave in train mode to finetune)


def embed(sentence):
   tokens = camembert.encode(sentence)
   # Extract all layer's features (layer 0 is the embedding layer)
   all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
   embeddings = all_layers[0]
   return embeddings

# Here we see that the shape of the embedding vector depends on the number of tokens in the sentence

u = embed("Bonjour, ça va ?")
u.shape # torch.Size([1, 7, 768])
v = embed("Salut, comment vas-tu ?")
v.shape # torch.Size([1, 9, 768])

现在想象一下，为了进行一些语义搜索，我想计算向量（在我们的例子中为张量）cosine distance和u之间的v：

cos = torch.nn.CosineSimilarity(dim=1)
cos(u, v) #will throw an error since the shape of `u` is different from the shape of `v`

我在问什么是最好的方法，以使句子始终保持相同的嵌入形状 无论其标记的数量如何？]]

=>我要考虑的第一个解决方案是计算mean on axis=1（句子的嵌入是嵌入其标记的平均值），因为axis = 0和axis = 2的大小始终相同：

cos = torch.nn.CosineSimilarity(dim=1)

cos(u.mean(axis=1), v.mean(axis=1)) # works now and gives 0.7269
但是，在计算均值时，恐怕会伤害句子的嵌入，因为它为每个标记赋予相同的权重（可能乘以TF-IDF？）。

=>第二种解决方案是填充较短的句子，这意味着：

一次给出要嵌入的句子列表（而不是逐句嵌入）
查找具有最长标记的句子并将其嵌入，得到其形状S
对于其余的句子，则将其填充零以得到相同的形状S（该句子在其余维度中为0）

您有什么想法？您还将使用其他哪些技术？为什么？

我正在从火炬中心加载语言模型（CamemBERT是基于法国RoBERTa的法语模型）并使用它嵌入一些句子：import torch camembert = torch.hub.load（'pytorch / fairseq'，'camembert.v0' ）...

Answer 1

这是一个非常普遍的问题，因为没有一个特定的正确答案。

Answer 2

Bert-as-service是完全按照您的要求做的一个很好的例子。

NLP变形金刚：获得固定句子嵌入矢量形状的最佳方法？

问题描述投票：5回答：2

2个回答

最新问题

NLP变形金刚：获得固定句子嵌入矢量形状的最佳方法？

问题描述 投票：5回答：2

2个回答

最新问题

问题描述投票：5回答：2