如何检测两个句子是否相似,不是在意思上,而是在音节/单词上?

问题描述 投票:0回答:1

以下是一些需要被视为“相似”的句子类型的示例

there was a most extraordinary noise going on shrinking rapidly she soon made out
there was a most extraordinary noise going on shrinking rapid
that will be a very little alice knew it was just possible it had
thou wilt be very little alice i knew it was possible to add
however at last it sat down and looked very anxiously into her face and
however that lives in sadtown and look very anxiously into him facing it
she went in search of her or of anything to say she simply bowed
she went in the search of her own or of anything to say
and she squeezed herself up on tiptoe and peeped over the wig he did
and she squeezed herself up on the tiptoe and peeped over her wig he did
she had not noticed before and behind it was very glad to find that
she had not noticed before and behind it it was very glad to find that
as soon as the soldiers had to fall a long hookah and taking not
soon as the soldiers have to fall along huka and taking knots

这里有一些更困难的边缘情况的例子,我希望能够捕捉到,但不是必需的

so she tucked it under her arm with its head it would not join
she tucked it under her arm with its head
let me see four times five is twelve and four times five is twelve 
let me see  times  is  and  times  is
let me see four times seven is oh dear run home this moment and 
times  is o dear run home this moment and
in a minute or two she walked sadly down the middle being held up 
and then well see you sidely down the middle in health often

有些不同且没有相似之处的句子需要标记为不相似。如果存在一种算法可以输出“分数”而不是相似或不相似的布尔值,我可以通过自己的测试确定需要什么阈值。

每个例子中最上面的句子是随机生成的;最下面的句子是语音到文本神经网络的输出,来自某人读出顶行的音频文件。如果有某种音节比较方法会更准确,因为我有初始源文本和音频,我也可以使用它来代替这种单词比较技术。

我当前的方法包括对每个单词进行索引,一次向前,一次反向,然后检查有多少单词排队。如果至少 10 个单词在任一索引顺序中匹配,我会将这些句子视为相似。然而,所提供的所有示例都是该策略不起作用的情况。

search nlp full-text-search similarity sentence-similarity
1个回答
0
投票

解决这个问题的一种方法(尽管可能不是最好的方法)是首先对两个句子中的单词进行向量化(即本质上为每个单词提供一个数字),这将为每个句子提供一个向量。然后比较这两个向量的相似性。

就代码而言,您可以在 python 中执行以下操作。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def sentence_similarity(sentence1, sentence2):
   # Tokenize sentences into n-grams
   ngram_range = (1, 3)  # range can be adjusted
   vectorizer = CountVectorizer(ngram_range=ngram_range)
   vectors = vectorizer.fit_transform([sentence1, sentence2])

   # Checking for similarity using cosine similarity (i.e. dot product)
   similarity_matrix = cosine_similarity(vectors)
   similarity_score = similarity_matrix[0, 1]

   return similarity_score

请注意,您需要安装 scikit learn 才能执行上述导入。您可以通过在 cmd 或终端中执行以下命令来完成此操作。

   pip install scikit-learn
© www.soinside.com 2019 - 2024. All rights reserved.