我有一个自训练的word2vec模型(2G,以“ .model”结尾。我将模型转换为文本文件(超过50G,以“ .txt”结尾),因为我必须在其他python代码中使用文本文件。我试图通过删除不需要的单词来减小文本文件的大小。我用我需要的所有单词建立了词汇表。如何过滤模型中不必要的词?
我已经尝试为文本文件构建字典,但是我的内存不足。
emb_dict = dict()
with open(emb_path, "r", encoding="utf-8") as f:
lines = f.readlines()
for l in lines:
word, embedding = l.strip().split(' ',1)
emb_dict[word] = embedding
我正在考虑是否可以删除“ .model”文件中的单词。我该怎么做?任何帮助,将不胜感激!
如果没有更精确的代码,很难进一步回答,但是您可以批量分析文本文件
lines_to_keep = []
new_file = "some_path.txt"
words_to_keep = set(some_words)
with open(emb_path, "r", encoding="utf-8") as f:
for l in f:
word, embedding = l.strip().split(' ',1)
if word in words_to_keep:
lines_to_keep.append(l.strip())
if lines_to_keep and len(lines_to_keep) % 1000 == 0:
with open(new_file, "a") as f:
f.write("\n".join(lines_to_keep)
lines_to_keep = []