我从https://spacy.io/universe/project/spacy-sentence-bert
提取了这段代码import spacy_sentence_bert
# load one of the models listed at https://github.com/MartinoMensio/spacy-sentence-bert/
nlp = spacy_sentence_bert.load_model('en_roberta_large_nli_stsb_mean_tokens')
# get two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# use the similarity method that is based on the vectors, on Doc, Span or Token
print(doc_1.similarity(doc_2[0:7]))
我有一个包含两列的数据框,其中包含如下句子。我试图找出每行句子之间的相似性。我尝试了几种不同的方法,但运气不佳,所以我想在这里问。谢谢大家。
当前df
Sentence1 | Sentence2
Another-Sentence1 | Another-Sentence2
Yet-Another-Sentence1 | Yet-Another-Sentence2
目标输出:
Sentence1 | Sentence2 | Similarity-Score-Sentence1-Sentence2
Another-Sentence1 | Another-Sentence2 | Similarity-Score-Another-Sentence1-Another-Sentence2
Yet-Another-Sentence1 | Yet-Another-Sentence2 | Similarity-Score-Yet-Another-Sentence1-Yet-Another-Sentence2
我假设你的第一行由标题组成,数据将从标题后的下一行开始,并且还假设你正在使用 panda 将 csv 转换为数据帧,下面的代码在我的环境中工作。
import spacy_sentence_bert
import pandas as pd
nlp = spacy_sentence_bert.load_model('en_roberta_large_nli_stsb_mean_tokens')
df = pd.read_csv('testing.csv')
similarityValue = []
for i in range(df.count()[0]):
sentence_1 = nlp(df.iloc[i][0])
sentence_2 = nlp(df.iloc[i][1])
similarityValue.append(sentence_1.similarity(sentence_2))
print(sentence_1, '|', sentence_2, '|', sentence_1.similarity(sentence_2))
df['Similarity'] = similarityValue
print(df)
输入 CSV:
输出: