将最相似的余弦排名文档映射回原始列表中的每个相应文档

问题描述 投票:0回答:1

我无法弄清楚如何将列表中最顶层(#1)最相似的文档映射回原始列表中的每个文档项目。

我经历了一些预处理,ngrams,词形还原和TF IDF。然后我使用Scikit的线性核心。我尝试使用提取功能,但不知道如何在csr矩阵中使用它...

尝试过各种各样的东西(Using csr_matrix of items similarities to get most similar items to item X without having to transform csr_matrix to dense matrix

import string, nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer 
from sklearn.metrics.pairwise import cosine_similarity
import sparse_dot_topn.sparse_dot_topn as ct
import re

documents = 'the cat in the hat','the catty ate the hat','the cat wants the cats hat'

def ngrams(string, n=2):
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]
lemmer = nltk.stem.WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, analyzer=ngrams, stop_words='english')
tfidf_matrix = TfidfVec.fit_transform(documents)

from sklearn.metrics.pairwise import linear_kernel
cosine_similarities = linear_kernel(tfidf_matrix[0:1], tfidf_matrix).flatten()

related_docs_indices = cosine_similarities.argsort()[:-5:-1]

cosine_similarities

我当前的例子只让我对抗所有文档的第一行。如何将看起来像这样的输出到数据帧中(注意原始文档来自数据帧)。

original df col             most similar doc       similarity%
'the cat in the hat'        'the catty ate the hat'   80%
'the catty ate the hat'     'the cat in the hat'      80%
'the cat wants the cats hat' 'the catty ate the hat'  20%
python pandas scikit-learn nlp cosine-similarity
1个回答
1
投票
import pandas as pd

df = pd.DataFrame(columns=["original df col", "most similar doc", "similarity%"])
for i in range(len(documents)):
    cosine_similarities = linear_kernel(tfidf_matrix[i:i+1], tfidf_matrix).flatten()
    # make pairs of (index, similarity)
    cosine_similarities = list(enumerate(cosine_similarities))
    # delete the cosine similarity with itself
    cosine_similarities.pop(i)
    # get the tuple with max similarity
    most_similar, similarity = max(cosine_similarities, key=lambda t:t[1])
    df.loc[len(df)] = [documents[i], documents[most_similar], similarity]

结果:

              original df col       most similar doc  similarity%
0          the cat in the hat  the catty ate the hat     0.664119
1       the catty ate the hat     the cat in the hat     0.664119
2  the cat wants the cats hat     the cat in the hat     0.577967
© www.soinside.com 2019 - 2024. All rights reserved.