使用 gensim 4 微调自定义 word2vec 模型

问题描述 投票:0回答:0

我是 gensim 的新手,尤其是 gensim 4。老实说,我发现很难理解文档如何微调预训练的 word2vec 模型。 我有一个保存在本地的二进制预训练模型。我想根据新数据微调这个模型。

我的问题是;

  • 如何创建合并两个词汇的词汇?
  • 这是微调 word2vec 模型的正确方法吗?

到目前为止,我已经创建了以下代码:

# path to pretrained model
pretrained_path = '../models/german.model'

# new data
sentences = df.stem_token_wo_sw.to_list() # Pandas column containing text data

# Create new model
w2v_de = Word2Vec(
    min_count = min_count,
    vector_size = vector_size,
    window = window,
    workers = workers,
)

# Build vocab
w2v_de.build_vocab(sentences)

# Extract number of examples
total_examples = w2v_de.corpus_count

# Load pretrained model
model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)

# Add previous words from pretrained model
w2v_de.build_vocab([list(model.key_to_index.keys())], update=True)

# Train model
w2v_de.train(sentences, total_examples=total_examples, epochs=2)

# create array of vectors
vectors = np.asarray(w2v_de.wv.vectors)
# create array of labels
labels = np.asarray(w2v_de.wv.index_to_key) 

# create dataframe of vectors for each word
w_emb = pd.DataFrame(
    index = labels,
    columns = [f'X{n}' for n in range(1, vectors.shape[1] + 1)],
    data = vectors,
)

训练后,我使用 PCA 将维度从 300 减少到两个,以绘制词嵌入空间。

# create pipeline
pipeline = Pipeline(
    steps = [
        # ('scaler', StandardScaler()),
        ('pca', PCA(n_components=2)),
    ]
)

# fit pipeline
pipeline.fit(w_emb)

# Transform vectors
vectors_transformed = pipeline.transform(w_emb)

w_emb_transformed = (
    pd.DataFrame(
        index = labels,
        columns = ['PC1', 'PC2'],
        data = vectors_transformed,
    )
)

labels
vectors
只包含新词,而不包含旧词和新词,我的情节和 PCA 值也是如此。

python scikit-learn nlp gensim word2vec
© www.soinside.com 2019 - 2024. All rights reserved.