文本的余弦相似度

问题描述 投票:0回答:0

我有一个数据集,其中一列包含课程名称。我需要编写代码以允许查询搜索返回与给定查询最相似的 10 个课程。

下面是我试过的代码。但是,它返回的查询本身得分为 1.0。请帮我解决这个问题!

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# List of documents to search

documents = list_of_names

# Transform documents into vectors using the CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit_transform(documents)

# Define the search query
query = 'aws'

# Transform the query into a vector using the fitted vectorizer
query_vector = vectorizer.transform([query])

# Calculate the cosine similarity between the query vector and all document vectors
similarity_scores = cosine_similarity(query_vector, vectorizer.transform(documents))

# Find the indices of the top 10 most similar documents
most_similar_indices = np.argsort(similarity_scores)[:, :-11:-1]

# Print the top 10 most similar documents and their similarity scores
for i, indices in enumerate(most_similar_indices[0]):
    print('Rank:', i+1)
    print('Document:', documents[indices])
    print('Similarity score:', similarity_scores[0][indices])
    print('---')
nlp cosine-similarity
© www.soinside.com 2019 - 2024. All rights reserved.