我得到了这个错误。"KeyError: word 'restriction' not in vocabulary", 当我读取一个文本文件来生成单词嵌入向量时,而单词 "restriction "在文本文件中。我想知道我读取文本文件(一个简单的段落)的代码是否有误?
我的代码写在下面。
from gensim.models import Word2Vec
# define training data
with open('D:\\test.txt', 'r') as file:
sentences = ""
#read from textfile
for line in file:
for word in line.split(' '):
sentences += word + ' '
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
print(str(model['restriction']))
当我在下面的代码中使用预写的句子时,这个错误不会发生。
from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence', 'with', 'restriction'],
['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
print('the model prints: ')
print(model['restriction'])
在你的代码中显示了这个问题,检查 sentences
仔细观察,在你构建了它之后,看看它是否是你所期望的格式(或者是任何类似于 sentences
的工作案例)。) 我怀疑它不是。
另外,看看这个令人失望的模型所学的单词列表----。words
变量应该足够了。它也可能不像你期望的那样。
具体来说,你的这段代码......
sentences = ""
for line in file:
for word in line.split(' '):
sentences += word + ' '
...使 sentences
一条长长的字符串,有很多空格分隔的单词。如果你这样做,对 sentences
在你的工作代码中,你将不再有一个列表,其中每个项目都是一个代币列表。 (这是对 Word2Vec
.) 相反,你会有一个巨大的运行字符串。
sentences = 'this is the first sentence for word2vec this is the second sentence yet another sentence one more sentence with restriction and the final sentence'
试试吧
sentences = [] # empty list
# OOPS, DON'T DO: sentences = ""
for line in file:
sentences.append(line.split(' '))
...那么你的 sentences
将是一个list-of-list-of-strings(像工作情况),而不是仅仅是一个字符串(像破碎情况)。