如何使用 spacy 查找数据框 2 列中句子的相似性

问题描述 投票:0回答:1

我从https://spacy.io/universe/project/spacy-sentence-bert

提取了这段代码
import spacy_sentence_bert
# load one of the models listed at https://github.com/MartinoMensio/spacy-sentence-bert/
nlp = spacy_sentence_bert.load_model('en_roberta_large_nli_stsb_mean_tokens')
# get two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# use the similarity method that is based on the vectors, on Doc, Span or Token
print(doc_1.similarity(doc_2[0:7]))

我有一个包含两列的数据框,其中包含如下句子。我试图找出每行句子之间的相似性。我尝试了几种不同的方法,但运气不佳,所以我想在这里问。谢谢大家。

当前df

Sentence1 | Sentence2

Another-Sentence1 | Another-Sentence2

Yet-Another-Sentence1 | Yet-Another-Sentence2

目标输出:

Sentence1 | Sentence2 | Similarity-Score-Sentence1-Sentence2

Another-Sentence1 | Another-Sentence2 | Similarity-Score-Another-Sentence1-Another-Sentence2

Yet-Another-Sentence1 | Yet-Another-Sentence2 | Similarity-Score-Yet-Another-Sentence1-Yet-Another-Sentence2
python pandas spacy similarity bert-language-model
1个回答
2
投票

我假设你的第一行由标题组成,数据将从标题后的下一行开始,并且还假设你正在使用 panda 将 csv 转换为数据帧,下面的代码在我的环境中工作。

import spacy_sentence_bert
import pandas as pd
nlp = spacy_sentence_bert.load_model('en_roberta_large_nli_stsb_mean_tokens')
df = pd.read_csv('testing.csv')
similarityValue = []

for i in range(df.count()[0]):
    sentence_1 = nlp(df.iloc[i][0])
    sentence_2 = nlp(df.iloc[i][1])
    similarityValue.append(sentence_1.similarity(sentence_2))
    print(sentence_1, '|', sentence_2, '|', sentence_1.similarity(sentence_2))

df['Similarity'] = similarityValue
print(df)

输入 CSV:

输出:

© www.soinside.com 2019 - 2024. All rights reserved.