如何将fasttext模型保存为vec格式？

Question

我使用 python 中的

fasttext.train_unsupervised()

函数训练了我的无监督模型。我想将其保存为 vec 文件，因为我将使用该文件作为

pretrainedVectors

函数中的

fasttext.train_supervised()

参数。

pretrainedVectors

只接受 vec 文件，但我在创建此 vec 文件时遇到了麻烦。有人可以帮助我吗？

诗。我可以将其保存为 bin 格式。如果您建议我一种将 bin 文件转换为 vec 文件的方法，也会很有帮助。

Answer 1

为了获得仅包含所有单词向量的VEC文件，我从bin_to_vec官方示例中获得灵感。

from fasttext import load_model

# original BIN model loading
f = load_model(YOUR-BIN-MODEL-PATH)
    lines=[]

# get all words from model
words = f.get_words()

with open(YOUR-VEC-FILE-PATH,'w') as file_out:
    
    # the first line must contain number of total words and vector dimension
    file_out.write(str(len(words)) + " " + str(f.get_dimension()) + "\n")

    # line by line, you append vectors to VEC file
    for w in words:
        v = f.get_word_vector(w)
        vstr = ""
        for vi in v:
            vstr += " " + str(vi)
        try:
            file_out.write(w + vstr+'\n')
        except:
            pass

获取的VEC文件可能很大。要减小文件大小，您可以调整矢量分量的格式。

如果只想保留 4 位小数，可以将

vstr += " " + str(vi)

替换为

vstr += " " + "{:.4f}".format(vi)

Answer 2

你应该在你的vec文件的第一行添加单词num和dimension，而不是使用-preTrainedVectors para

Answer 3

您还可以尝试使用 gensim 库生成快速文本嵌入。 gensim 模型具有

wv.save_word2vec_format

函数，可以直接生成

.vec

文件。

from gensim.models import FastText

sentences = open('data.txt','r').readlines() #data.txt contains a sentence on every line.

#Apply your desired tokenisation method to the sentences
tokenized_sentences = tokenize(sentences)

model = FastText(vector_size=300, window=5, min_count=1, sentences=tokenized_sentences, epochs=10)

#Save vectors in .vec file
model.wv.save_word2vec_format("embeddings.vec")

如何将fasttext模型保存为vec格式？

问题描述投票：0回答：3

3个回答

最新问题

如何将fasttext模型保存为vec格式？

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3