如何在Python中找到具有相似度分数的大字符串中的相似子字符串？

Question

我正在寻找的不仅仅是两个文本之间的简单相似度分数。而是字符串内子字符串的相似度得分。说：

text1 = 'cat is sleeping on the mat'.

text2 = 'The cat is sleeping on the red mat in the living room'.

在上面的例子中，

text1

的所有单词都完全出现在

text2

中，因此相似度应该是100%。

如果缺少

text1

的某些单词，分数会降低。

我正在处理不同段落大小的大型数据集，因此在具有此类相似度得分的较大段落中找到较小的段落至关重要。

我只发现了字符串相似度，例如余弦相似度、difflib 相似度等，它们比较两个字符串。但不是关于另一个字符串内的子字符串的分数。

Answer 1

根据您的描述，怎么样：

>>> a = "cat is sleeping on the mat"
>>> b = "the cat is sleeping on the red mat in the living room"
>>> a = a.split(" ")
>>> score = 0.0
>>> for word in a: #for every word in your string
        if word in b: #if it is in your bigger string increase score
            score += 1
>>> score/len(a) #obtain percentage given total word number
1.0

如果缺少单词，例如：

>>> c = "the cat is not sleeping on the mat"
>>> c = c.split(" ")
>>> score = 0.0
>>> for w in c:
        if w in b:
            score +=1
>>> score/len(c)
0.875

此外，您可以按照 @roadrunner 的建议进行操作，拆分

并将其保存为一组，以使用

b = set(b.split(" "))

加快您的表现。这会将该部分的复杂性降低至

O(1)

，并将整体算法提高至

O(n)

复杂性。

编辑：您说您已经尝试了一些指标，例如余弦相似度等。但是我怀疑您可能会从检查Levenshtein Distance相似度中受益，我怀疑在这种情况下，作为所提供的解决方案的补充，这可能会有一些用处。

Answer 2

您还可以使用

collections.defaultdict

来存储

word_a

中存在于

word_b

中的单词数，然后

sum()

最后将计数除以

word_a

的长度：

from collections import defaultdict

a = "the cat is not sleeping on the mat"
b = "the cat is sleeping on the red mat in the living room"

word_a = a.split()
word_b = set(b.split())

d = defaultdict(int)
for word in word_a:
    if word in word_b:
        d[word] += 1

print(sum(d.values()) / len(word_a))

哪个输出：

0.875

注意：由于我们只关心检查

word_a

中的单词是否存在于

word_b

中，因此将

word_b

转换为

set()

将允许

O(1)

查找，而不是将其保留为列表，这将是

O(n)

。这使得上述代码的整体时间复杂度

O(n)

。

Answer 3

与 DarkCygbus 类似，但相似性是基于其总字符数而不是单词数。另一方面，该脚本仅检查与完整单词的一致性（text_2.split()）

from __future__ import division

text_1 = 'cat is sleeping on the mat'
text_2 = 'The cat is sleeping on the red mat in the living room'
no_match = 0
match = 0

for word in text_1.split():
    if word not in text_2.split():
        no_match += len(word)
    else:
        match += len(word)

similarity = match/(match + no_match)
print ('{0:.0%}'.format(similarity))

Answer 4

我认为这可以通过编辑距离结合子串匹配来实现。可以做的是将一个句子分割成更小的单词（使用空格作为分隔符），然后运行 Levenshtein 匹配算法将单个单词与您的搜索字符串进行匹配。比如：

def similar_word(string, substring):
    threshold=2

    def levenshtein_distance(s1, s2):
        m, n = len(s1), len(s2)
        dp = [[0] * (n + 1) for _ in range(m + 1)]

        for i in range(m + 1):
            for j in range(n + 1):
                if i == 0: dp[i][j] = j
                elif j == 0: dp[i][j] = i
                elif s1[i - 1] == s2[j - 1]: dp[i][j] = dp[i - 1][j - 1]
                else: dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])
        return dp[m][n]

    for i in range(len(string) - len(substring) + 1):
        distance = levenshtein_distance(string[i:i + len(substring)], substring)
        if distance <= threshold: return True
    
    return False

https://gist.github.com/4f77616973/66a784c4c5921359299d603419a8f01b

既然你想要分数，你可以调整上面的代码以返回距离而不是

True

/

False

。

希望有帮助！ :)

如何在Python中找到具有相似度分数的大字符串中的相似子字符串？

问题描述投票：0回答：4

4个回答

最新问题

如何在Python中找到具有相似度分数的大字符串中的相似子字符串？

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4