嵌入pytorch

问题描述 投票:0回答:5

Embedding
是否会让相似的单词彼此更接近?我只需要给它所有的句子吗?或者它只是一个查找表,我需要对模型进行编码?

python pytorch word-embedding
5个回答
97
投票

nn.Embedding
保存维度为
(vocab_size, vector_size)
的张量,即词汇表的大小 x 每个向量嵌入的维度,以及执行查找的方法。

当您创建嵌入层时,张量会随机初始化。只有当你训练它时,相似词之间的相似性才会出现。除非您使用之前训练的模型(例如 GloVe 或 Word2Vec)覆盖了嵌入的值,但那是另一个故事了。

因此,一旦定义了嵌入层,定义并编码了词汇表(即为词汇表中的每个单词分配一个唯一的编号),您就可以使用 nn.Embedding 类的实例来获取相应的嵌入。

例如:

import torch
from torch import nn
embedding = nn.Embedding(1000,128)
embedding(torch.LongTensor([3,4]))

将返回与词汇表中的单词 3 和 4 对应的嵌入向量。由于没有训练任何模型,它们将是随机的。


48
投票

您可以将

nn.Embedding
视为一个查找表,其中键是单词索引,值是相应的单词向量。但是,在使用它之前,您应该指定查找表的大小,并自行初始化词向量。以下是演示这一点的代码示例。

import torch.nn as nn 

# vocab_size is the number of words in your train, val and test set
# vector_size is the dimension of the word vectors you are using
embed = nn.Embedding(vocab_size, vector_size)

# intialize the word vectors, pretrained_weights is a 
# numpy array of size (vocab_size, vector_size) and 
# pretrained_weights[i] retrieves the word vector of
# i-th word in the vocabulary
embed.weight.data.copy_(torch.fromnumpy(pretrained_weights))

# Then turn the word index into actual word vector
vocab = {"some": 0, "words": 1}
word_indexes = [vocab[w] for w in ["some", "words"]] 
word_vectors = embed(word_indexes)

28
投票

torch.nn.Embedding
只是创建一个查找表,以获取给定单词索引的单词嵌入。

from collections import Counter
import torch.nn as nn

# Let's say you have 2 sentences(lowercased, punctuations removed) :
sentences = "i am new to PyTorch i am having fun"

words = sentences.split(' ')
    
vocab = Counter(words) # create a dictionary
vocab = sorted(vocab, key=vocab.get, reverse=True)
vocab_size = len(vocab)

# map words to unique indices
word2idx = {word: ind for ind, word in enumerate(vocab)} 

# word2idx = {'i': 0, 'am': 1, 'new': 2, 'to': 3, 'pytorch': 4, 'having': 5, 'fun': 6}

encoded_sentences = [word2idx[word] for word in words]

# encoded_sentences = [0, 1, 2, 3, 4, 0, 1, 5, 6]

# let's say you want embedding dimension to be 3
emb_dim = 3 

现在,嵌入层可以初始化为:

emb_layer = nn.Embedding(vocab_size, emb_dim)
word_vectors = emb_layer(torch.LongTensor(encoded_sentences))

这将从标准正态分布(即 0 均值和单位方差)初始化嵌入。因此,这些词向量没有任何“相关性”的意义。

word_vectors 是大小为 (9,3) 的火炬张量。 (因为我们的数据中有 9 个单词)

emb_layer 有一个名为 weight 的可训练参数,默认情况下设置为待训练。您可以通过以下方式检查:

emb_layer.weight.requires_grad

返回 True。如果您不想在模型训练期间训练嵌入(例如,当您使用预先训练的嵌入时),您可以通过以下方式将它们设置为 False:

emb_layer.weight.requires_grad = False

如果您的词汇量大小为 10,000 并且您希望使用预先训练的嵌入(暗淡 300)来初始化嵌入,例如 Word2Vec,请执行以下操作:

emb_layer = nn.Embedding(10000, 300)
emb_layer.load_state_dict({'weight': torch.from_numpy(emb_mat)})

这里,emb_mat 是一个大小为 (10,000, 300) 的 Numpy 矩阵,其中包含词汇表中 10,000 个单词中每个单词的 300 维 Word2vec 单词向量。

现在,嵌入层加载了 Word2Vec 单词表示。


8
投票

啊!我认为这部分仍然缺失。展示当您设置嵌入层时,您会自动获得权重,您可以稍后更改

nn.Embedding.from_pretrained(weight)

import torch
import torch.nn as nn

embedding = nn.Embedding(10, 4)
print(type(embedding))
print(embedding)

t1 = embedding(torch.LongTensor([0,1,2,3,4,5,6,7,8,9])) # adding, 10 won't work
print(t1.shape)
print(t1)


t2 = embedding(torch.LongTensor([1,2,3]))
print(t2.shape)
print(t2)

#predefined weights
weight = torch.FloatTensor([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]])
print(weight.shape)
embedding = nn.Embedding.from_pretrained(weight)
# get embeddings for ind 0 and 1
embedding(torch.LongTensor([0, 1]))

输出:

<class 'torch.nn.modules.sparse.Embedding'>
Embedding(10, 4)
torch.Size([10, 4])
tensor([[-0.7007,  0.0169, -0.9943, -0.6584],
        [-0.7390, -0.6449,  0.1481, -1.4454],
        [-0.1407, -0.1081,  0.6704, -0.9218],
        [-0.2738, -0.2832,  0.7743,  0.5836],
        [ 0.4950, -1.4879,  0.4768,  0.4148],
        [ 0.0826, -0.7024,  1.2711,  0.7964],
        [-2.0595,  2.1670, -0.1599,  2.1746],
        [-2.5193,  0.6946, -0.0624, -0.1500],
        [ 0.5307, -0.7593, -1.7844,  0.1132],
        [-0.0371, -0.5854, -1.0221,  2.3451]], grad_fn=<EmbeddingBackward>)
torch.Size([3, 4])
tensor([[-0.7390, -0.6449,  0.1481, -1.4454],
        [-0.1407, -0.1081,  0.6704, -0.9218],
        [-0.2738, -0.2832,  0.7743,  0.5836]], grad_fn=<EmbeddingBackward>)
torch.Size([2, 3])

tensor([[0.1000, 0.2000, 0.3000],
        [0.4000, 0.5000, 0.6000]])

最后一部分是嵌入层权重可以通过梯度下降来学习。


0
投票

还有一招

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()



vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)


    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        return embeds

model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["beauty"]])
© www.soinside.com 2019 - 2024. All rights reserved.