我想使用 Python 代码以颜色突出显示两个字符串之间的差异。
示例1:
sentence1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
sentence2 = "I am enjoying the summer breeze on the beach while I am doing some pilates."
预期结果(星号部分为红色):
I *am* enjoying the summer breeze on the beach while I *am doing* some pilates.
示例2:
sentence1: "My favourite season is Autumn while my sister's favourite season is Winter."
sentence2: "My favourite season is Autumn, while my sister's favourite season is Winter."
预期结果(逗号不同):
"My favourite season is Autumn*,* while my sister's favourite season is Winter."
我试过这个:
sentence1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
sentence2 = "I'm enjoying the summer breeze on the beach while I am doing some pilates."
# Split the sentences into words
words1 = sentence1.split()
words2 = sentence2.split()
# Find the index where the sentences differ
index_of_difference = next((i for i, (word1, word2) in enumerate(zip(words1, words2)) if word1 != word2), None)
# Highlight differing part "am doing" in red
highlighted_words = []
for i, (word1, word2) in enumerate(zip(words1, words2)):
if i == index_of_difference:
highlighted_words.append('\033[91m' + word2 + '\033[0m')
else:
highlighted_words.append(word2)
highlighted_sentence = ' '.join(highlighted_words)
print(highlighted_sentence)
我得到了这个:
I'm enjoying the summer breeze on the beach while I *am* doing some
而不是这个:
I'm enjoying the summer breeze on the beach while I *am doing* some pilates.
我该如何解决这个问题?
我建议使用
difflib
matching_blocks
方法将匹配的子字符串与其余子字符串隔离。
这里我做了一个例子,我从头开始重建单词,但只要子字符串位于匹配之外,就使用
*
。
import difflib as dl
def line_builder(blocks, sentence, a_or_b):
new_sentence = ""
position = 0
for block in blocks:
match_start = getattr(block, a_or_b)
if block.size == 0:
continue
if match_start > position:
new_sentence += f"*{sentence[position:match_start]}*"
new_sentence += sentence[match_start: match_start + block.size]
position = match_start + block.size
return new_sentence
def print_diffs(a, b):
s = dl.SequenceMatcher(a=a, b=b)
m = s.get_matching_blocks()
new_a = line_builder(m, a, "a")
new_b = line_builder(m, b, "b")
print(f"Here are the differences\n\t{new_a}\n\t{new_b}")
实际效果如下:
$ python -i text_diffs.py
>>> print_diffs("This is the most fun I have ever had", "This was the most fun I could have ever had")
Here are the differences
This *i*s the most fun I have ever had
This *wa*s the most fun I *could *have ever had
>>>
保持中间这可能不适用于所有示例,它写得很快,可能仍然需要针对边缘情况进行一些微调。
使用
difflib
获取匹配块:
from difflib import SequenceMatcher
s1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
s2 = "I'm enjoying the summer breeze on the beach while I am doing some pilates."
x = SequenceMatcher(None, s1, s2)
m = x.get_matching_blocks()
[出]:
[Match(a=0, b=0, size=52),
Match(a=52, b=55, size=2),
Match(a=54, b=60, size=14),
Match(a=68, b=74, size=0)]
然后,使用颜色字符串将颜色放在子字符串上:
s2_new = ""
for m in x.get_matching_blocks():
if m.b > i:
s2_new += s2[i:m.b]
s2_new += f"\033[91m{s2[m.b:m.b+m.size]}\033[0m"
i = m.b + m.size
print(s2_new)
[出]:
\x1b[91mI'm enjoying the summer breeze on the beach while I \x1b[0mam \x1b[91mdo\x1b[0ming\x1b[91m some pilates.\x1b[0m\x1b[91m\x1b[0m
或者,如果您想要比
get_matching_blocks()
更小的粒度,请尝试:
from difflib import SequenceMatcher
s1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
s2 = "I'm enjoying the summer breeze on the beach while I am doing some pilates."
x = SequenceMatcher(None, s1, s2)
matches = []
a, b = 0, 0
while True:
m = x.find_longest_match(alo=a, ahi=len(s1), blo=b, bhi=len(s2))
a, b = m.a + m.size, m.b + m.size
if m.size == 0:
break
else:
matches.append(m)
print(matches)
[出]:
[Match(a=0, b=0, size=52), Match(a=54, b=60, size=14)]