使用Word2Vec的Skip-gram无法正常工作

问题描述 投票:1回答:1

我正在尝试构建一个word2vec相似度字典。我能够构建一个字典,但相似性没有正确填充。我在代码中遗漏了什么吗?

输入样本数据文本

TAK PO LUN UNIT 3 15/F WAYSON COMMERCIAL G 28 CONNAUGHT RD WEST SHEUNG WAN
- EDDY SUSANTO YAHYA ROOM 1503-05 WESTERN CENTRE 40-50 DES VOEUX W. SHEUNG WAN
DNA FINANCIAL SYSTEMS INC UNIT 10 19F WAYSON COMMERCIAL 28 CONNAUGHT RD SHEUNG WAN
G/F 60 PO HING FONG SHEUNG WAN
10B CENTRAL MANSION 270 QUEENS RD CENTRAL SHEUNG WAN
AKAMAI INTERNATIONAL BV C/O IADVANTAGE 28/F OF MEGA I-ADVANTAGE 399 CHAI WAN RD CHAI WAN HONG KO HONG KONG
VICTORIA CHAN F/5E 1-3 FLEMING RD WANCHI WAN CHAI
HISTREND 365 5/F FOO TAK BUILDING 365 HENNESSY RD WAN CHAI H WAN CHAI
ROOM 1201 12F CHINACHEM JOHNSO PLAZA 178 186 JOHNSTON RD WAN CHAI
LUEN WO BUILDING 339 HENNESSY RD 9 FLOOR WAN CHAI HONG KONG

我的代码:

import gensim
from gensim import corpora,similarities,models
class AccCorpus(object):

   def __init__(self):
       self.path = ''

   def __iter__(self):
       for sentence in data["Adj_Addr"]:
           yield [word.lower() for word in sentence.split()]

   def build_corpus():
       model = gensim.models.word2vec.Word2Vec(alpha=0.05, min_alpha=0.05,window=2,sg=1)
       sentences = AccCorpus()
       model.build_vocab(sentences)
       for epoch in range(1):
           model.train(sentences,total_examples=model.corpus_count, epochs=model.iter)
           model.alpha -= 0.002  # decrease the learning rate
           model.min_alpha = model.alpha  # fix the learning rate, no decay

       model_name = "word2vec_model"
       model.save(model_name)
       return model

model=build_corpus()

我的结果:

model.most_similar("wan")
[('want', 0.6867533922195435),
 ('puiwan', 0.6323356032371521),
 ('wan.', 0.6132887005805969),
 ('wanstreet', 0.5945449471473694),
 ('aupuiwan', 0.594132661819458),
 ('futan', 0.5883135199546814),
 ('fotan', 0.5817855000495911),
 ('shanmei', 0.5807071924209595),
 ('30-33', 0.5789132118225098),
 ('61-63au', 0.5711270570755005)]

以下是我对相似性的预期输出:上环,湾仔,柴湾。我猜我的skipgrams不能正常工作。我怎样才能解决这个问题?

scikit-learn neural-network word2vec gensim word-embedding
1个回答
2
投票

正如评论中已经建议的那样,除非你确定它是必要的(在你的情况下它不是,很可能),所以不需要调整alpha和其他内部参数。

你得到了很多额外的结果,因为它在你的数据中。我不知道Adj_Addr是什么,但它不仅仅是你提供的文字:puiwanfutanfotan,...... - 这些都不在上面的文字中。

这是干净的测试,就像你希望它工作一样(我只留下相关的部分,随意添加sg=1 - 也可以):

import gensim

text = """TAK PO LUN UNIT 3 15/F WAYSON COMMERCIAL G 28 CONNAUGHT RD WEST SHEUNG WAN
- EDDY SUSANTO YAHYA ROOM 1503-05 WESTERN CENTRE 40-50 DES VOEUX W. SHEUNG WAN
DNA FINANCIAL SYSTEMS INC UNIT 10 19F WAYSON COMMERCIAL 28 CONNAUGHT RD SHEUNG WAN
G/F 60 PO HING FONG SHEUNG WAN
10B CENTRAL MANSION 270 QUEENS RD CENTRAL SHEUNG WAN
AKAMAI INTERNATIONAL BV C/O IADVANTAGE 28/F OF MEGA I-ADVANTAGE 399 CHAI WAN RD CHAI WAN HONG KO HONG KONG
VICTORIA CHAN F/5E 1-3 FLEMING RD WANCHI WAN CHAI
HISTREND 365 5/F FOO TAK BUILDING 365 HENNESSY RD WAN CHAI H WAN CHAI
ROOM 1201 12F CHINACHEM JOHNSO PLAZA 178 186 JOHNSTON RD WAN CHAI
LUEN WO BUILDING 339 HENNESSY RD 9 FLOOR WAN CHAI HONG KONG"""

sentences = text.split('\n')

class AccCorpus(object):
  def __init__(self):
    self.path = ''

  def __iter__(self):
    for sentence in sentences:
      yield [word.lower() for word in sentence.split()]

def build_corpus():
  model = gensim.models.word2vec.Word2Vec()
  sentences = AccCorpus()
  model.build_vocab(sentences)
  model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
  return model

model = build_corpus()
print(model.most_similar("wan"))

结果是:

[('chai', 0.04687393456697464), ('rd', -0.03181878849864006), ('sheung', -0.06769674271345139)]
© www.soinside.com 2019 - 2024. All rights reserved.