计算字符串之间的余弦相似度没有得到预期的结果

Question

我想计算 pandas 数据框同一行中的两个字符串之间的成对余弦相似度。

我使用了以下几行代码：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


pd.set_option('display.float_format', '{:.4f}'.format)


df = pd.DataFrame({'text1': ['The quick brown fox jumps over the lazy dog', 'The red apple', 'The big blue sky'],
                   'text2': ['The lazy cat jumps over the brown dog', 'The red apple', 'The big yellow sun']})


vectorizer = CountVectorizer().fit_transform(df['text1'] + ' ' + df['text2'])


cosine_similarities = cosine_similarity(vectorizer)[:, 0:1]


df['cosine_similarity'] = cosine_similarities


print(df)

它给了我以下输出，这似乎不正确：

谁能帮我弄清楚我做错了什么？

谢谢。

Answer 1

我不是专家，但这是一种方法。

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

pd.set_option('display.float_format', '{:.4f}'.format)

df = pd.DataFrame({'text1': ['The quick brown fox jumps over the lazy dog',
                             'The red apple',
                             'The big blue sky'],
                   'text2': ['The lazy cat jumps over the brown dog',
                             'The red apple',
                             'The big yellow sun']})

vectorizer = CountVectorizer()

# np.hstack([df["text1"], df["text2"]]) puts all "text2" after "text1"
X = vectorizer.fit_transform(np.hstack([df["text1"], df["text2"]]))

cs = cosine_similarity(X)  # full symmetric numpy.ndarray

# The values you want are on an offset diagonal of cs since
# "text2" strings were stacked at the end of "text1" strings

pairwise_cs = cs.diagonal(offset=len(df))
df["cosine_similarity"] = pairwise_cs

print(df)

显示：

                                         text1                                  text2  cosine_similarity
0  The quick brown fox jumps over the lazy dog  The lazy cat jumps over the brown dog             0.8581
1                                The red apple                          The red apple             1.0000
2                             The big blue sky                     The big yellow sun             0.5000

计算字符串之间的余弦相似度没有得到预期的结果

问题描述投票：0回答：1

1个回答

最新问题

计算字符串之间的余弦相似度没有得到预期的结果

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1