我有一个数据集,其中一列包含课程名称。我需要编写代码以允许查询搜索返回与给定查询最相似的 10 个课程。
下面是我试过的代码。但是,它返回的查询本身得分为 1.0。请帮我解决这个问题!
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# List of documents to search
documents = list_of_names
# Transform documents into vectors using the CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit_transform(documents)
# Define the search query
query = 'aws'
# Transform the query into a vector using the fitted vectorizer
query_vector = vectorizer.transform([query])
# Calculate the cosine similarity between the query vector and all document vectors
similarity_scores = cosine_similarity(query_vector, vectorizer.transform(documents))
# Find the indices of the top 10 most similar documents
most_similar_indices = np.argsort(similarity_scores)[:, :-11:-1]
# Print the top 10 most similar documents and their similarity scores
for i, indices in enumerate(most_similar_indices[0]):
print('Rank:', i+1)
print('Document:', documents[indices])
print('Similarity score:', similarity_scores[0][indices])
print('---')