Word2Vec 计算电影与高预演电影的相似度

问题描述 投票:0回答:1

我有一个包含电影用户评分和电影描述的数据集

import pandas as pd

df =pd.DataFrame ({
    'description': [
        'Two imprisoned men bond over a number of years',
        'A family heads to an isolated hotel for the winter',
        'In a future where technology controls everything',
        'A young lion prince flees his kingdom only to learn the true meaning of responsibility',
        'A group of intergalactic criminals are forced to work together to stop a fanatical warrior'
    ],
    'ratings': [8.7, 9.3, 7.9, 8.5, 8.1]
})
df

我想使用描述(以及其他特征)来预测电影的评分。

我正在尝试使用 Word2Vec 来计算相似度分数,该分数将确定新电影与过去表现良好的电影的相似程度。我的计划是定义表现最好的电影,并在将数据集与另一种机器学习算法一起使用之前计算数据集中所有电影的相似度分数。

但是我在计算相似度得分时遇到了麻烦(我以前从未使用过这种方法)。

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Create Tokens
df['tokenized_description'] = df['description'].apply(lambda x: word_tokenize(x.lower()))

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=df['tokenized_description'], vector_size=100, window=5, min_count=1, workers=4)

# define top performing movies
threshold = df['ratings'].quantile(0.75)
highest_grossing_movies = df[df['ratings'] >= threshold]

# Tokenize descriptions of highest-grossing movies
highest_grossing_movies['tokenized_description'] = highest_grossing_movies['description'].apply(lambda x: word_tokenize(x.lower()))

# Convert the tokenized descriptions to embeddings
embeddings_high_grossing = highest_grossing_movies['description'].apply(lambda desc: word2vec_model.wv[word_tokenize(desc)]).tolist()

# Assess similarity for each movie description in the entire DataFrame
df['similarity_score'] = [word2vec_model.wv.similarity(df['description'])

当我运行代码时,出现错误

KeyError: "Key 'Two' not present"

我确信代码的最后一行是错误的,但我不知道如何纠正它。

python nlp word2vec
1个回答
0
投票

确保将输入描述转换为小写:

# Convert the tokenized descriptions to embeddings
embeddings_high_grossing = highest_grossing_movies['description'].apply(
    lambda desc: word2vec_model.wv[word_tokenize(desc.lower())]
).tolist()
© www.soinside.com 2019 - 2024. All rights reserved.