Gensim的 "model.wv.most_similar "返回语音相似的词。

问题描述 投票:0回答:1

gensim的 wv.most_similar 返回语音相近的词(相似的声音),而不是语义相似的词。这正常吗?为什么会出现这种情况?

以下是关于 most_similar: https:/radimrehurek.comgensimmodelskeyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar。

In [144]: len(vectors.vocab)
Out[144]: 32966

... 

In [140]: vectors.most_similar('fight')
Out[140]:
[('Night', 0.9940935373306274),
 ('knight', 0.9928507804870605),
 ('fright', 0.9925899505615234),
 ('light', 0.9919329285621643),
 ('bright', 0.9914385080337524),
 ('plight', 0.9912853240966797),
 ('Eight', 0.9912533760070801),
 ('sight', 0.9908033013343811),
 ('playwright', 0.9905624985694885),
 ('slight', 0.990411102771759)]

In [141]: vectors.most_similar('care')
Out[141]:
[('spare', 0.9710584878921509),
 ('scare', 0.9626247882843018),
 ('share', 0.9594929218292236),
 ('prepare', 0.9584596157073975),
 ('aware', 0.9551078081130981),
 ('negare', 0.9550014138221741),
 ('glassware', 0.9507938027381897),
 ('Welfare', 0.9489598274230957),
 ('warfare', 0.9487678408622742),
 ('square', 0.9473209381103516)]

训练数据包含学术论文,这是我的训练脚本。

from gensim.models.fasttext import FastText as FT_gensim
import gensim.models.keyedvectors as word2vec

dim_size = 300
epochs = 10
model = FT_gensim(size=dim_size, window=3, min_count=1)
model.build_vocab(sentences=corpus_reader, progress_per=1000)
model.train(sentences=corpus_reader, total_examples=total_examples, epochs=epochs)

# saving vectors to disk
path = "/home/ubuntu/volume/my_vectors.vectors"
model.wv.save_word2vec_format(path, binary=True)

# loading vectors 
vectors = word2vec.KeyedVectors.load_word2vec_format(path)
python data-science gensim embedding word-embedding
1个回答
2
投票

您选择了使用 FastText 算法来训练你的向量。该算法专门利用子词片段(如 'ight''are'),才有机会为训练集中没有的 "词汇外 "词合成好的猜测向量,这可能是你看到的结果的一个原因。

然而,通常单词的独特含义占主导地位,只有对于未知的单词,这种子词的影响才会发挥作用。而且,最相似的列表中的 任何 在一组健康的词向量中,有这么多的词。0.99+相似性。

所以,我怀疑你的训练数据有怪异或不足之处。

是什么样的文本,其中包含多少总字数的例子用法?

在训练过程中,info级别的日志中有没有显示出训练进度速度方面的困惑?

(300个维度对于只有33K个独特词汇的词汇量来说可能也有点过高了,这是一个矢量大小,在有几十万到几百万个独特词汇的工作中很常见,而且训练数据也很丰富)。


0
投票

这是对维度大小的一个很好的呼吁。减少这个param绝对是有区别的。

1. 用更大的语料库(33k -->27.5万独特词汇)重现原始行为(其中dim_size=300)。

注意:我还调整了其他一些参数,比如说 min_count, window,等)。)

from gensim.models.fasttext import FastText as FT_gensim

fmodel0 = FT_gensim(size=300, window=5, min_count=3, workers=10) # window is The maximum distance between the current and predicted word within a sentence.
fmodel0.build_vocab(sentences=corpus)
fmodel0.train(sentences=corpus, total_examples=fmodel0.corpus_count, epochs=5)

fmodel0.wv.vocab['cancer'].count  # number of times the word occurred in the corpus
fmodel0.wv.most_similar('cancer')
fmodel0.wv.most_similar('care')
fmodel0.wv.most_similar('fight')

# -----------
# cancer 
[('breastcancer', 0.9182084798812866),
 ('noncancer', 0.9133851528167725),
 ('skincancer', 0.898530900478363),
 ('cancerous', 0.892244279384613),
 ('cancers', 0.8634265065193176),
 ('anticancer', 0.8527657985687256),
 ('Cancer', 0.8359113931655884),
 ('lancer', 0.8296531438827515),
 ('Anticancer', 0.826178252696991),
 ('precancerous', 0.8116365671157837)]

# care
[('_care', 0.9151567816734314),
 ('încălcare', 0.874087929725647),
 ('Nexcare', 0.8578598499298096),
 ('diacare', 0.8515325784683228),
 ('încercare', 0.8445525765419006),
 ('fiecare', 0.8335763812065125),
 ('Mulcare', 0.8296753168106079),
 ('Fiecare', 0.8292017579078674),
 ('homecare', 0.8251558542251587),
 ('carece', 0.8141698837280273)]

# fight
[('Ifight', 0.892048180103302),
 ('fistfight', 0.8553390502929688),
 ('dogfight', 0.8371964693069458),
 ('fighter', 0.8167843818664551),
 ('bullfight', 0.8025394678115845),
 ('gunfight', 0.7972971200942993),
 ('fights', 0.790093183517456),
 ('Gunfight', 0.7893823385238647),
 ('fighting', 0.775499701499939),
 ('Fistfight', 0.770946741104126)]

2. 将维度大小减少到5。

_fmodel = FT_gensim(size=5, window=5, min_count=3, workers=10)
_fmodel.build_vocab(sentences=corpus)
_fmodel.train(sentences=corpus, total_examples=_fmodel.corpus_count, epochs=5)  # workers is specified in the constructor


_fmodel.wv.vocab['cancer'].count  # number of times the word occurred in the corpus
_fmodel.wv.most_similar('cancer')
_fmodel.wv.most_similar('care')
_fmodel.wv.most_similar('fight')

# cancer 
[('nutrient', 0.999614417552948),
 ('reuptake', 0.9987781047821045),
 ('organ', 0.9987629652023315),
 ('tracheal', 0.9985960721969604),
 ('digestion', 0.9984923601150513),
 ('cortes', 0.9977986812591553),
 ('liposomes', 0.9977765679359436),
 ('adder', 0.997713565826416),
 ('adrenals', 0.9977011680603027),
 ('digestive', 0.9976763129234314)]

# care
[('lappropriate', 0.9990135431289673),
 ('coping', 0.9984776973724365),
 ('promovem', 0.9983049035072327),
 ('requièrent', 0.9982239603996277),
 ('diverso', 0.9977829456329346),
 ('feebleness', 0.9977156519889832),
 ('pathetical', 0.9975940585136414),
 ('procure', 0.997504472732544),
 ('delinking', 0.9973599910736084),
 ('entonces', 0.99733966588974)]

# fight 
[('decied', 0.9996457099914551),
 ('uprightly', 0.999250054359436),
 ('chillies', 0.9990670680999756),
 ('stuttered', 0.998710036277771),
 ('cries', 0.9985755681991577),
 ('famish', 0.998246431350708),
 ('immortalizes', 0.9981046915054321),
 ('misled', 0.9980905055999756),
 ('whore', 0.9980045557022095),
 ('chanted', 0.9978444576263428)]

虽然不是GREAT,但它不再返回仅仅包含子词的词。

3. 而为了更好的衡量,与Word2Vec进行基准测试。

from gensim.models.word2vec import Word2Vec

wmodel300 = Word2Vec(corpus, size=300, window=5, min_count=2, workers=10)
wmodel300.total_train_time  # 187.1828162111342
wmodel300.wv.most_similar('cancer')

[('cancers', 0.6576876640319824),
 ('melanoma', 0.6564366817474365),
 ('malignancy', 0.6342018842697144),
 ('leukemia', 0.6293295621871948),
 ('disease', 0.6270142197608948),
 ('adenocarcinoma', 0.6181445121765137),
 ('Cancer', 0.6010828614234924),
 ('tumors', 0.5926551222801208),
 ('carcinoma', 0.5917977094650269),
 ('malignant', 0.5778893828392029)]

^ 更好地捕捉分布相似性+更真实的相似性测量.

但如果用更小的dim_size,结果就会差一些(相似度也不太现实,都在0.99左右)。

wmodel5 = Word2Vec(corpus, size=5, window=5, min_count=2, workers=10)
wmodel5.total_train_time  # 151.4945764541626
wmodel5.wv.most_similar('cancer')

[('insulin', 0.9990534782409668),
 ('reaction', 0.9970406889915466),
 ('embryos', 0.9970351457595825),
 ('antibiotics', 0.9967449903488159),
 ('supplements', 0.9962579011917114),
 ('synthesize', 0.996055543422699),
 ('allergies', 0.9959680438041687),
 ('gadgets', 0.9957243204116821),
 ('mild', 0.9953152537345886),
 ('asthma', 0.994774580001831)]

因此,增加维度大小似乎对Word2Vec有帮助,但对fastText没有帮助... ...

我相信这种反差与fastText模型正在学习子词信息有关,而且某种程度上,这与param的交互方式增加其值是有伤害的。但我不知道具体是怎么做的...... 我试图将这一发现与增加向量大小的直觉相协调,一般来说,增加向量的大小应该是有帮助的,因为较大的向量可以捕获更多的信息。

© www.soinside.com 2019 - 2024. All rights reserved.