Python difflib.SequenceMatcher 比较问题

问题描述 投票:0回答:1

我正在尝试比较两个大文本字符串。每个可以包含大约 15,000 个字符。我需要通过比较两个字符串来找到替换、插入、删除和等于它们的开始和结束字符,以便可以正确完成进一步的操作。

我也尝试过 difflib 库,但它没有给出好的结果。

impoer difflib
para1 = "In the year 2045, the world experienced a technological revolution like never before. The advances in artificial intelligence, robotics, and biotechnology had transformed every aspect of our lives. From healthcare to transportation, from education to entertainment, the impact of these innovations was profound. One of the most remarkable achievements of this era was the development of AI-powered personal assistants. These intelligent beings were capable of understanding and responding to human language with unparalleled accuracy. They could perform tasks, answer questions, and even engage in meaningful conversations. As AI continued to evolve, it became an integral part of our daily routines. People relied on AI for managing their schedules, making decisions, and even providing emotional support. It seemed that there was no limit to what these machines could do. The ethical implications of AI's increasing dominance over human affairs became a topic of heated debate. While some celebrated the convenience and progress it brought, others raised concerns about privacy, security, and the potential loss of human jobs. Despite the ongoing discussions and debates, AI's influence in our lives kept growing. It was a world where humans and machines coexisted, sometimes harmoniously and sometimes with friction. The future was uncertain, but one thing was clear: technology had forever changed the course of human history. This is the end of Text 1."
para2 = "In the year 2045, the world went through a technological revolution of unprecedented proportions. The leaps in artificial intelligence, robotics, and biotechnology had completely reshaped all aspects of our existence. From medical care to transportation, from learning to entertainment, the impact of these innovations was profound. One of the most extraordinary accomplishments of this era was the emergence of AI-driven personal assistants. These intelligent entities had the capability to comprehend and react to human language with remarkable precision. They could execute tasks, provide answers to queries, and even engage in substantial conversations. As AI kept progressing, it turned into an essential element of our everyday lives. People depended on AI for handling their schedules, making choices, and even offering emotional support. It seemed as if there were no boundaries to the potential of these machines. The moral questions surrounding AI's growing authority over human affairs became a subject of fervent discussion. While some celebrated the convenience and advancement it brought, others expressed worries about privacy, security, and the possible loss of human employment. Despite the continuous conversations and debates, AI's sway over our lives continued to expand. It was a world where humans and machines coexisted, sometimes peacefully and at times with friction. The future remained uncertain, but one thing was evident: technology had permanently altered the trajectory of human history. This is the end of Text 2."

op = difflib.SequenceMatcher(None, para1, para2)
op.get_opcodes()

输出-

[('equal', 0, 28, 0, 28),
 ('insert', 28, 28, 28, 207),
 ('equal', 28, 30, 207, 209),
 ('replace', 30, 83, 209, 215),
 ('equal', 83, 86, 215, 218),
 ('replace', 86, 125, 218, 253),
 ('equal', 125, 127, 253, 255),
 ('replace', 127, 135, 255, 285),
 ('equal', 135, 137, 285, 287),
 ('replace', 137, 196, 287, 331),
 ('equal', 196, 198, 331, 333),
 ('replace', 198, 297, 333, 390),
 ('equal', 297, 302, 390, 395),
 ('replace', 302, 383, 395, 408),
 ('equal', 383, 390, 408, 415),
 ('replace', 390, 392, 415, 531),
 ('equal', 392, 393, 531, 532),
 ('replace', 393, 417, 532, 556),
 ('equal', 417, 422, 556, 561),
 ('replace', 422, 437, 561, 633),
 ('equal', 437, 438, 633, 634),
 ('replace', 438, 607, 634, 641),
 ('equal', 607, 630, 641, 664),
 ('replace', 630, 649, 664, 680),
 ('equal', 649, 654, 680, 685),
 ('replace', 654, 697, 685, 737),
 ('equal', 697, 708, 737, 748),
 ('replace', 708, 712, 748, 754),
 ('equal', 712, 725, 754, 767),
 ('replace', 725, 730, 767, 772),
 ('equal', 730, 758, 772, 800),
 ('replace', 758, 766, 800, 806),
 ('equal', 766, 778, 806, 818),
 ('replace', 778, 784, 818, 823),
 ('equal', 784, 817, 823, 856),
 ('replace', 817, 821, 856, 861),
 ('equal', 821, 829, 861, 869),
 ('replace', 829, 872, 869, 921),
 ('equal', 872, 878, 921, 927),
 ('replace', 878, 901, 927, 954),
 ('equal', 901, 907, 954, 960),
 ('replace', 907, 927, 960, 977),
 ('equal', 927, 956, 977, 1006),
 ('replace', 956, 974, 1006, 1008),
 ('equal', 974, 975, 1008, 1009),
 ('replace', 975, 978, 1009, 1035),
 ('equal', 978, 1022, 1035, 1079),
 ('replace', 1022, 1030, 1079, 1090),
 ('equal', 1030, 1050, 1090, 1110),
 ('replace', 1050, 1064, 1110, 1126),
 ('equal', 1064, 1101, 1126, 1163),
 ('replace', 1101, 1125, 1163, 1166),
 ('equal', 1125, 1126, 1166, 1167),
 ('replace', 1126, 1127, 1167, 1194),
 ('equal', 1127, 1141, 1194, 1208),
 ('replace', 1141, 1156, 1208, 1228),
 ('equal', 1156, 1179, 1228, 1251),
 ('replace', 1179, 1210, 1251, 1252),
 ('equal', 1210, 1211, 1252, 1253),
 ('replace', 1211, 1214, 1253, 1290),
 ('equal', 1214, 1278, 1290, 1354),
 ('replace', 1278, 1299, 1354, 1372),
 ('equal', 1299, 1331, 1372, 1404),
 ('insert', 1331, 1331, 1404, 1438),
 ('equal', 1331, 1335, 1438, 1442),
 ('replace', 1335, 1369, 1442, 1449),
 ('equal', 1369, 1386, 1449, 1466),
 ('replace', 1386, 1412, 1466, 1500),
 ('equal', 1412, 1455, 1500, 1543),
 ('replace', 1455, 1456, 1543, 1544),
 ('equal', 1456, 1457, 1544, 1545)]

在上面的输出中,提到了

('insert', 28, 28, 28, 207)
,这意味着在字符位置 28 之后,添加了一个字符串:

para2[28:207]

'went through a technological revolution of unprecedented proportions. The leaps in artificial intelligence, robotics, and biotechnology had completely reshaped all aspects of our ' 

但实际上“experienced”更新为“went through”,并且从字符位置 41 开始,“a Technical Revolution”在两个字符串中是相等的,但未捕获。

我尝试过将其转换为列表然后进行比较,但效果不是很好。

get_opcodes
有我需要的所有信息,但结果非常不正确。是否有解决方法、另一个可用的库或任何 NLP 方法可以获得良好的结果?

python diff
1个回答
0
投票

一种解决方法可能是将字符串转换为单词序列,这比基于字符的编辑给人类带来的惊喜要少得多:

import difflib

para1 = "In the year 2045, the world experienced a technological revolution like never before. The advances in artificial intelligence, robotics, and biotechnology had transformed every aspect of our lives. From healthcare to transportation, from education to entertainment, the impact of these innovations was profound. One of the most remarkable achievements of this era was the development of AI-powered personal assistants. These intelligent beings were capable of understanding and responding to human language with unparalleled accuracy. They could perform tasks, answer questions, and even engage in meaningful conversations. As AI continued to evolve, it became an integral part of our daily routines. People relied on AI for managing their schedules, making decisions, and even providing emotional support. It seemed that there was no limit to what these machines could do. The ethical implications of AI's increasing dominance over human affairs became a topic of heated debate. While some celebrated the convenience and progress it brought, others raised concerns about privacy, security, and the potential loss of human jobs. Despite the ongoing discussions and debates, AI's influence in our lives kept growing. It was a world where humans and machines coexisted, sometimes harmoniously and sometimes with friction. The future was uncertain, but one thing was clear: technology had forever changed the course of human history. This is the end of Text 1."
para2 = "In the year 2045, the world went through a technological revolution of unprecedented proportions. The leaps in artificial intelligence, robotics, and biotechnology had completely reshaped all aspects of our existence. From medical care to transportation, from learning to entertainment, the impact of these innovations was profound. One of the most extraordinary accomplishments of this era was the emergence of AI-driven personal assistants. These intelligent entities had the capability to comprehend and react to human language with remarkable precision. They could execute tasks, provide answers to queries, and even engage in substantial conversations. As AI kept progressing, it turned into an essential element of our everyday lives. People depended on AI for handling their schedules, making choices, and even offering emotional support. It seemed as if there were no boundaries to the potential of these machines. The moral questions surrounding AI's growing authority over human affairs became a subject of fervent discussion. While some celebrated the convenience and advancement it brought, others expressed worries about privacy, security, and the possible loss of human employment. Despite the continuous conversations and debates, AI's sway over our lives continued to expand. It was a world where humans and machines coexisted, sometimes peacefully and at times with friction. The future remained uncertain, but one thing was evident: technology had permanently altered the trajectory of human history. This is the end of Text 2."
para1 = para1.split()
para2 = para2.split()
for m in difflib.SequenceMatcher(None, para1, para2).get_matching_blocks():
    print(' '.join(para1[m.a:m.a + m.size]))

输出:

In the year 2045, the world
a technological revolution
The
in artificial intelligence, robotics, and biotechnology had
...

如果需要,您可以将基于单词的编辑转换回基于字符的编辑。

© www.soinside.com 2019 - 2024. All rights reserved.