Gensim以txt格式保存单词向量错误

问题描述 投票:0回答:1

我的问题如下。我有一些以txt格式保存的预训练向量,我将它们加载到dict中。但是当我在gensim中再次训练它们后尝试保存它们时,它给了我一个错误,如下所示:

UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

我正在使用此代码创建gensim word2vec:

w2vObject = gensim.models.Word2Vec(min_count=1, sample=threshold, sg=1,size=dimension, negative=15, iter=epochsNum, window=3) # create only the shell

print('Starting vocab build')
# t = time()
w2vObject.build_vocab(sentences, progress_per=10000) #here is the vocab being built as told in google groups gensim

print(w2vObject.wv['the'], 'before train')

然后我将当前未训练的向量替换为:

f = codecs.open(f'../../../WordNetGraphHD/StorageEmbeddings/EmbeddingFormat{dimension}.txt')##os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
embeddings_index = {}
for num, line in enumerate(f):
    values = my_split(line) # line.split('\t')
    word = values[0].rstrip()
    # vector = ''.join(num for num in values[1:])
    vector = values[1]
    if len(vector) != 300:
        print(line, 'here not 300')

    else:
        coefs = np.asarray(vector)

f.close()

此代码替换了我自己预先训练的未经训练的随机向量:

i = 0
for elem in w2vObject.wv.vocab:
    if elem in embeddings_index.keys():
        w2vObject.wv[elem] = embeddings_index[elem]
        i += 1
        print('Found one', i)

print(i)

接下来我用gensim再次训练它们:

w2vObject.train(sentences, total_examples=w2vObject.corpus_count, epochs=epochsNum)#w2vObject.iter)

最后保存它们:

print(w2vObject.wv, 'after train')
w2vObject.wv.save_word2vec_format('./GensimOneWNet.txt', binary=False)
print('saved')

如果我不使用自己的保存作品替换矢量,但是我需要替换它们并将其另存为txt,有什么帮助吗?

编辑:

这里是my_split()函数:

def my_split(s):
    return list(re.split("-?\d+.?\d*(?:[Ee]-\d+)?", s))[0] ,list(re.findall("-?\d+.?\d*(?:[Ee]-\d+)?", s))

这是embedding_index的一些数据300维:

'hood -0.013093032778433955 -0.004199660490964164 -0.013285915004532987 0.004154925177649314 -0.004331536946207293 -0.013220217973950956 -0.004774150107654365 0.004774714449991327 0.0040749706101727646...
's gravenhage 0.01400977963089465 -0.0047073654478706935 -0.004326147699308312 0.01323622314514233 -0.004702524319745591 0.004695915697719624 0.00497792763673179 -0.004391661500805715 0.0046651111592470...
'tween 0.008467020793348493 -0.008027116343722267 0.007882368315816719 0.00754852526967863 0.008563484027417608 0.00812782576892597 0.008192394872536986 0.0075759585496093206...

在此处添加代码:Python code runs fine without my vectors, crashes with them

填充embedding_index,我会遍历txt中的所有单词和向量,如果由于某种原因向量不是300暗,请跳过它:

f = codecs.open(f'../../../WordNetGraphHD/StorageEmbeddings/EmbeddingFormat{dimension}.txt', encoding='utf-8')##os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
embeddings_index = {}
for num, line in enumerate(f):
    values = my_split(line) # line.split('\t')
    word = values[0].rstrip()
    vector = values[1]
    if len(vector) != 300:
        print(line, 'here not 300')
    else:
        coefs = np.asarray(vector)
        embeddings_index[word] = coefs

f.close()

EDIT2:这是完整错误的堆栈跟踪:追溯(最近一次通话):

  File "GensimTestSave.py", line 136, in <module>
    w2vObject.wv.save_word2vec_format('./GensimOneWNet.txt', binary=False) #encoding='utf-8' )
  File "/home/pedalo/anaconda3/envs/ltu/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1453, in save_word2vec_format
    fname, self.vocab, self.vectors, fvocab=fvocab, binary=binary, total_vec=total_vec)
  File "/home/pedalo/anaconda3/envs/ltu/lib/python3.7/site-packages/gensim/models/utils_any2vec.py", line 291, in _save_word2vec_format
    fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
  File "/home/pedalo/anaconda3/envs/ltu/lib/python3.7/site-packages/gensim/models/utils_any2vec.py", line 291, in <genexpr>
    fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
python gensim word2vec
1个回答
0
投票

我现在修复了。显然我试图使用nd.array(),但是将字符串作为系数,gensim使用nd.array(floats),这是我自己的向量在切换到.wv []时类型为[str]的问题。所以它最终是空的。

现在矢量的切换完成:

for elem in setIntersection:
    if len(embeddings_index[elem]) != 300:
        print('here', elem) #cast it to the fire
    w2vObject.wv[elem] = np.asarray(embeddings_index[elem], dtype=np.float32)
print('Done!!!')

感谢您的评论,他们帮助我弄清楚了。

© www.soinside.com 2019 - 2024. All rights reserved.