我正在用 python 编程,我想知道是否有一种简单的方法来迭代(和解析)由
diff
实用程序输出的文本差异的各个块。
伪Python代码:
textdiff = diff('file1', 'file2') # Pretend this is a way of producing a diff
for chunk in textdiff:
print('These lines are deleted: ', chunk.deletions)
... # And do some other interesting stuff with the extracted contents of the chunk.
使用
SequenceMatcher
库中的 difflib
类。这是 difflib
函数使用的类。直接使用此类将允许您迭代 diff 的“块”,而无需解析任何 diff 函数的结果。
这是一个示例脚本来说明这个想法:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Iterate over the chunks of a diff between two files."""
from difflib import SequenceMatcher
def diff(a, b):
"""Iterate over the chunks of a diff between two files.
Examples:
Replace a single line:
>>> for chunk in diff(["1\\n"], ["2\\n"]):
... print(chunk)
{'tag': 'replace', 'i1': 0, 'i2': 1, 'j1': 0, 'j2': 1, 'a': ['1\\n'], 'b': ['2\\n']}
Delete a single line:
>>> for chunk in diff(["1\\n"], []):
... print(chunk)
{'tag': 'delete', 'i1': 0, 'i2': 1, 'j1': 0, 'j2': 0, 'a': ['1\\n'], 'b': []}
Insert a single line:
>>> for chunk in diff([], ["2\\n"]):
... print(chunk)
{'tag': 'insert', 'i1': 0, 'i2': 0, 'j1': 0, 'j2': 1, 'a': [], 'b': ['2\\n']}
"""
# Iterate over the groups of opcodes and yield the chunks as dicts
for group in SequenceMatcher(a=a, b=b).get_grouped_opcodes(n=0):
for tag, i1, i2, j1, j2 in group:
if tag in ('replace', 'delete', 'insert'):
patch = {
"tag": tag,
"i1": i1,
"i2": i2,
"j1": j1,
"j2": j2,
"a": a[i1:i2],
"b": b[j1:j2],
}
yield patch
if __name__ == "__main__":
import sys
# Get the contents of the first file as a list of lines
a = open(sys.argv[1]).readlines()
# Get the contents of the second file as a list of lines
b = open(sys.argv[2]).readlines()
# Iterate over the chunks of the diff and print them
for chunk in diff(a, b):
print(chunk)
这里是
difflib
库的源代码https://github.com/python/cpython/blob/3.12/Lib/difflib.py
查看
difflib
源代码,似乎它的diff函数依赖于SequenceMatcher
类。此类提供了迭代差异“块”的方法。
例如,考虑
unified_diff
函数,可在此处找到:https://github.com/python/cpython/blob/12a30bc1aa0586308bf3fe12c915bcc5e54a032f/Lib/difflib.py#L1095-L1161)
这是
unified_diff
的实现:
def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
tofiledate='', n=3, lineterm='\n'):
_check_types(a, b, fromfile, tofile, fromfiledate, tofiledate, lineterm)
started = False
for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
if not started:
started = True
fromdate = '\t{}'.format(fromfiledate) if fromfiledate else ''
todate = '\t{}'.format(tofiledate) if tofiledate else ''
yield '--- {}{}{}'.format(fromfile, fromdate, lineterm)
yield '+++ {}{}{}'.format(tofile, todate, lineterm)
first, last = group[0], group[-1]
file1_range = _format_range_unified(first[1], last[2])
file2_range = _format_range_unified(first[3], last[4])
yield '@@ -{} +{} @@{}'.format(file1_range, file2_range, lineterm)
for tag, i1, i2, j1, j2 in group:
if tag == 'equal':
for line in a[i1:i2]:
yield ' ' + line
continue
if tag in {'replace', 'delete'}:
for line in a[i1:i2]:
yield '-' + line
if tag in {'replace', 'insert'}:
for line in b[j1:j2]:
yield '+' + line
如您所见,它是
get_grouped_opcodes
类的 SequenceMatcher
方法,它迭代 diff 的块。
更准确地说,
get_grouped_opcodes
方法返回一个生成器,该生成器迭代识别两个文件之间匹配块的元组列表。每个元组具有以下格式:
(tag, i1, i2, j1, j2)
其中
tag
标识两个匹配代码块之间的关系(tag
是 replace
、delete
、insert
或 equal
之一),i1
和 i2
是索引,标识文件 a
中的代码块,j1
和 j2
是标识文件 b
中相应代码块的索引。
unified_diff
函数以unidiff格式生成输出(https://en.wikipedia.org/wiki/Diff#Unified_format),但我们可以调整此函数来执行其他操作。