如何在自训练的word2vec模型中删除单词

问题描述 投票:0回答:1

我有一个自训练的word2vec模型(2G,以“ .model”结尾。我将模型转换为文本文件(超过50G,以“ .txt”结尾),因为我必须在其他python代码中使用文本文件。我试图通过删除不需要的单词来减小文本文件的大小。我用我需要的所有单词建立了词汇表。如何过滤模型中不必要的词?

我已经尝试为文本文件构建字典,但是我的内存不足。

emb_dict = dict()
with open(emb_path, "r", encoding="utf-8") as f:
    lines = f.readlines()
    for l in lines:
        word, embedding = l.strip().split(' ',1)
        emb_dict[word] = embedding

我正在考虑是否可以删除“ .model”文件中的单词。我该怎么做?任何帮助,将不胜感激!

python word2vec
1个回答
0
投票

如果没有更精确的代码,很难进一步回答,但是您可以批量分析文本文件

lines_to_keep = []
new_file = "some_path.txt"
words_to_keep = set(some_words)
with open(emb_path, "r", encoding="utf-8") as f:
    for l in f:
        word, embedding = l.strip().split(' ',1)
        if word in words_to_keep:
            lines_to_keep.append(l.strip())
        if lines_to_keep and len(lines_to_keep) % 1000 == 0:
            with open(new_file, "a") as f:
                f.write("\n".join(lines_to_keep)
            lines_to_keep = []
© www.soinside.com 2019 - 2024. All rights reserved.