我想计算 pandas 数据框同一行中的两个字符串之间的成对余弦相似度。
我使用了以下几行代码:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
pd.set_option('display.float_format', '{:.4f}'.format)
df = pd.DataFrame({'text1': ['The quick brown fox jumps over the lazy dog', 'The red apple', 'The big blue sky'],
'text2': ['The lazy cat jumps over the brown dog', 'The red apple', 'The big yellow sun']})
vectorizer = CountVectorizer().fit_transform(df['text1'] + ' ' + df['text2'])
cosine_similarities = cosine_similarity(vectorizer)[:, 0:1]
df['cosine_similarity'] = cosine_similarities
print(df)
它给了我以下输出,这似乎不正确:
谁能帮我弄清楚我做错了什么?
谢谢。
我不是专家,但这是一种方法。
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
pd.set_option('display.float_format', '{:.4f}'.format)
df = pd.DataFrame({'text1': ['The quick brown fox jumps over the lazy dog',
'The red apple',
'The big blue sky'],
'text2': ['The lazy cat jumps over the brown dog',
'The red apple',
'The big yellow sun']})
vectorizer = CountVectorizer()
# np.hstack([df["text1"], df["text2"]]) puts all "text2" after "text1"
X = vectorizer.fit_transform(np.hstack([df["text1"], df["text2"]]))
cs = cosine_similarity(X) # full symmetric numpy.ndarray
# The values you want are on an offset diagonal of cs since
# "text2" strings were stacked at the end of "text1" strings
pairwise_cs = cs.diagonal(offset=len(df))
df["cosine_similarity"] = pairwise_cs
print(df)
显示:
text1 text2 cosine_similarity
0 The quick brown fox jumps over the lazy dog The lazy cat jumps over the brown dog 0.8581
1 The red apple The red apple 1.0000
2 The big blue sky The big yellow sun 0.5000