PyTorch：将单词向量加载到领域词汇与嵌入层中

Question

我从Keras到PyTorch。 我想创建一个PyTorch嵌入层（具有V x D大小的矩阵，其中V在词汇单词索引上，D是嵌入向量的维数）与GloVe向量，但被需要的内容弄糊涂了步骤。

在Keras中，通过让嵌入层构造函数使用you can load the GloVe vectors参数来实现weights：

# Keras code.
embedding_layer = Embedding(..., weights=[embedding_matrix])

[当查看PyTorch和TorchText库时，我看到应该在Field中然后在Embedding层中重新加载嵌入[。这是我发现的sample code：

# PyTorch code. # Create a field for text and build a vocabulary with 'glove.6B.100d' # pretrained embeddings. TEXT = data.Field(tokenize = 'spacy', include_lengths = True) TEXT.build_vocab(train_data, vectors='glove.6B.100d') # Build an RNN model with an Embedding layer. class RNN(nn.Module): def __init__(self, ...): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) ... # Initialize the embedding layer with the Glove embeddings from the # vocabulary. Why are two steps needed??? model = RNN(...) pretrained_embeddings = TEXT.vocab.vectors model.embedding.weight.data.copy_(pretrained_embeddings)
特定：
为什么在Field之外的Embedding中还加载GloVe嵌入？
我以为Field函数build_vocab()只是根据训练数据建立其词汇表。在此步骤中，GloVe嵌入在这里如何涉及？
这里还有其他
not
回答了我的问题的StackOverflow问题：PyTorch / Gensim - How to load pre-trained word embeddings
Embedding in pytorch
PyTorch LSTM - using word embeddings instead of nn.Embedding()
感谢您的帮助。

Answer 1

[torchtext构建词汇表时，它将标记索引与嵌入对齐。如果您的词汇表的大小和顺序与预训练的嵌入词不同，则无法保证索引匹配，因此您可能会查找错误的嵌入词。 build_vocab()为具有相应嵌入的数据集创建词汇表，并丢弃其余嵌入，因为这些未使用。

GloVe-6B嵌入包含400K大小的词汇表。例如，IMDB dataset仅使用其中的120K，其余280K未使用。

import torch from torchtext import data, datasets, vocab TEXT = data.Field(tokenize='spacy', include_lengths=True) LABEL = data.LabelField() train_data, test_data = datasets.IMDB.splits(TEXT, LABEL) TEXT.build_vocab(train_data, vectors='glove.6B.100d') TEXT.vocab.vectors.size() # => torch.Size([121417, 100]) # For comparison the full GloVe glove = vocab.GloVe(name="6B", dim=100) glove.vectors.size() # => torch.Size([400000, 100]) # Embedding of the first token is not the same torch.equal(TEXT.vocab.vectors[0], glove.vectors[0]) # => False # Index of the word "the" TEXT.vocab.stoi["the"] # => 2 glove.stoi["the"] # => 0 # Same embedding when using the respective index of the same word torch.equal(TEXT.vocab.vectors[2], glove.vectors[0]) # => True

在构建了带有嵌入词的词汇表之后，将以标记化版本给出输入序列，其中每个标记均由其索引表示。在模型中，您要使用这些内容的嵌入，因此需要创建嵌入层，但要使用词汇表的嵌入。最简单推荐的方法是nn.Embedding.from_pretrained，与Keras版本基本相同。

nn.Embedding.from_pretrained

[您没有提到在Keras版本中如何创建embedding_layer = nn.Embedding.from_pretrained(TEXT.vocab.vectors)

# Or if you want to make it trainable
trainable_embedding_layer = nn.Embedding.from_pretrained(TEXT.vocab.vectors, freeze=False)
，也没有提到如何构建词汇表以便可以与embedding_matrix一起使用。如果您手动执行此操作（或使用任何其他实用程序），则根本不需要embedding_matrix，并且可以像在Keras中一样初始化嵌入。 torchtext纯粹是为了方便执行与通用数据相关的任务。

PyTorch：将单词向量加载到领域词汇与嵌入层中

问题描述投票：0回答：1

1个回答

最新问题

PyTorch：将单词向量加载到领域词汇与嵌入层中

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1