如何在Python中突出显示两个字符串之间的差异?

问题描述 投票:0回答:2

我想使用 Python 代码以颜色突出显示两个字符串之间的差异。

示例1:

sentence1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
sentence2 = "I am enjoying the summer breeze on the beach while I am doing some pilates."

预期结果(星号部分为红色):

 I *am* enjoying the summer breeze on the beach while I *am doing* some pilates.

示例2:

sentence1: "My favourite season is Autumn while my sister's favourite season is Winter."
sentence2: "My favourite season is Autumn, while my sister's favourite season is Winter."

预期结果(逗号不同):

"My favourite season is Autumn*,* while my sister's favourite season is Winter." 

我试过这个:

sentence1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
sentence2 = "I'm enjoying the summer breeze on the beach while I am doing some pilates."

# Split the sentences into words
words1 = sentence1.split()
words2 = sentence2.split()

# Find the index where the sentences differ
index_of_difference = next((i for i, (word1, word2) in enumerate(zip(words1, words2)) if word1 != word2), None)

# Highlight differing part "am doing" in red
highlighted_words = []
for i, (word1, word2) in enumerate(zip(words1, words2)):
    if i == index_of_difference:
        highlighted_words.append('\033[91m' + word2 + '\033[0m')
    else:
        highlighted_words.append(word2)

highlighted_sentence = ' '.join(highlighted_words)
print(highlighted_sentence)

我得到了这个:

I'm enjoying the summer breeze on the beach while I *am* doing some

而不是这个:

I'm enjoying the summer breeze on the beach while I *am doing* some pilates.

我该如何解决这个问题?

python string colors nlp difference
2个回答
0
投票

我建议使用

difflib
matching_blocks
方法将匹配的子字符串与其余子字符串隔离。

这里我做了一个例子,我从头开始重建单词,但只要子字符串位于匹配之外,就使用

*

代码:

import difflib as dl


def line_builder(blocks, sentence, a_or_b):
    new_sentence = ""
    position = 0
    for block in blocks:
        match_start = getattr(block, a_or_b)
        if block.size == 0:
            continue
        if match_start > position:
            new_sentence += f"*{sentence[position:match_start]}*"
        new_sentence += sentence[match_start: match_start + block.size]
        position = match_start + block.size
    return new_sentence


def print_diffs(a, b):
    s = dl.SequenceMatcher(a=a, b=b)
    m = s.get_matching_blocks()
    new_a = line_builder(m, a, "a")
    new_b = line_builder(m, b, "b")
    print(f"Here are the differences\n\t{new_a}\n\t{new_b}")

实际效果如下:

$ python -i text_diffs.py 
>>> print_diffs("This is the most fun I have ever had", "This was the most fun I could have ever had")
Here are the differences
    This *i*s the most fun I have ever had
    This *wa*s the most fun I *could *have ever had
>>> 

保持中间这可能不适用于所有示例,它写得很快,可能仍然需要针对边缘情况进行一些微调。


0
投票

使用

difflib
获取匹配块:

from difflib import SequenceMatcher

s1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
s2 = "I'm enjoying the summer breeze on the beach while I am doing some pilates."

x = SequenceMatcher(None, s1, s2)
m = x.get_matching_blocks()

[出]:

[Match(a=0, b=0, size=52),
 Match(a=52, b=55, size=2),
 Match(a=54, b=60, size=14),
 Match(a=68, b=74, size=0)]

然后,使用颜色字符串将颜色放在子字符串上:


s2_new = ""
for m in x.get_matching_blocks():
    if m.b > i:
        s2_new += s2[i:m.b]
    s2_new += f"\033[91m{s2[m.b:m.b+m.size]}\033[0m"
    i = m.b + m.size
    
print(s2_new)

[出]:

\x1b[91mI'm enjoying the summer breeze on the beach while I \x1b[0mam \x1b[91mdo\x1b[0ming\x1b[91m some pilates.\x1b[0m\x1b[91m\x1b[0m

或者,如果您想要比

get_matching_blocks()
更小的粒度,请尝试:

from difflib import SequenceMatcher

s1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
s2 = "I'm enjoying the summer breeze on the beach while I am doing some pilates."

x = SequenceMatcher(None, s1, s2)

matches = []
a, b = 0, 0
while True:
    m = x.find_longest_match(alo=a, ahi=len(s1), blo=b, bhi=len(s2))
    a, b = m.a + m.size, m.b + m.size
    if m.size == 0:
        break
    else:
        matches.append(m)
        
print(matches)

[出]:

[Match(a=0, b=0, size=52), Match(a=54, b=60, size=14)]
© www.soinside.com 2019 - 2024. All rights reserved.