以编程方式迭代文本差异中的块

Question

我正在用 python 编程，我想知道是否有一种简单的方法来迭代（和解析）由

diff

实用程序输出的文本差异的各个块。

伪Python代码：

textdiff = diff('file1', 'file2') # Pretend this is a way of producing a diff
for chunk in textdiff:
    print('These lines are deleted: ', chunk.deletions)
    ... # And do some other interesting stuff with the extracted contents of the chunk.

Answer 1

简短回答

使用

SequenceMatcher

库中的

difflib

类。这是

difflib

函数使用的类。直接使用此类将允许您迭代 diff 的“块”，而无需解析任何 diff 函数的结果。

示例解决方案

这是一个示例脚本来说明这个想法：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Iterate over the chunks of a diff between two files."""

from difflib import SequenceMatcher

def diff(a, b):
    """Iterate over the chunks of a diff between two files.
    
    Examples:

    Replace a single line:

    >>> for chunk in diff(["1\\n"], ["2\\n"]):
    ...     print(chunk)
    {'tag': 'replace', 'i1': 0, 'i2': 1, 'j1': 0, 'j2': 1, 'a': ['1\\n'], 'b': ['2\\n']}

    Delete a single line:

    >>> for chunk in diff(["1\\n"], []):
    ...     print(chunk)
    {'tag': 'delete', 'i1': 0, 'i2': 1, 'j1': 0, 'j2': 0, 'a': ['1\\n'], 'b': []}

    Insert a single line:

    >>> for chunk in diff([], ["2\\n"]):
    ...     print(chunk)
    {'tag': 'insert', 'i1': 0, 'i2': 0, 'j1': 0, 'j2': 1, 'a': [], 'b': ['2\\n']}
    """

    # Iterate over the groups of opcodes and yield the chunks as dicts
    for group in SequenceMatcher(a=a, b=b).get_grouped_opcodes(n=0):
        for tag, i1, i2, j1, j2 in group:
            if tag in ('replace', 'delete', 'insert'):
                patch = {
                    "tag": tag,
                    "i1": i1,
                    "i2": i2,
                    "j1": j1,
                    "j2": j2,
                    "a": a[i1:i2],
                    "b": b[j1:j2],
                }
                yield patch

if __name__ == "__main__":

    import sys

    # Get the contents of the first file as a list of lines
    a = open(sys.argv[1]).readlines()

    # Get the contents of the second file as a list of lines
    b = open(sys.argv[2]).readlines()

    # Iterate over the chunks of the diff and print them
    for chunk in diff(a, b):
        print(chunk)

补充说明

这里是

difflib

库的源代码https://github.com/python/cpython/blob/3.12/Lib/difflib.py

查看

difflib

源代码，似乎它的diff函数依赖于

SequenceMatcher

类。此类提供了迭代差异“块”的方法。

例如，考虑

unified_diff

函数，可在此处找到：https://github.com/python/cpython/blob/12a30bc1aa0586308bf3fe12c915bcc5e54a032f/Lib/difflib.py#L1095-L1161)

这是

unified_diff

的实现：

def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
                 tofiledate='', n=3, lineterm='\n'):

    _check_types(a, b, fromfile, tofile, fromfiledate, tofiledate, lineterm)
    started = False
    for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
        if not started:
            started = True
            fromdate = '\t{}'.format(fromfiledate) if fromfiledate else ''
            todate = '\t{}'.format(tofiledate) if tofiledate else ''
            yield '--- {}{}{}'.format(fromfile, fromdate, lineterm)
            yield '+++ {}{}{}'.format(tofile, todate, lineterm)

        first, last = group[0], group[-1]
        file1_range = _format_range_unified(first[1], last[2])
        file2_range = _format_range_unified(first[3], last[4])
        yield '@@ -{} +{} @@{}'.format(file1_range, file2_range, lineterm)

        for tag, i1, i2, j1, j2 in group:
            if tag == 'equal':
                for line in a[i1:i2]:
                    yield ' ' + line
                continue
            if tag in {'replace', 'delete'}:
                for line in a[i1:i2]:
                    yield '-' + line
            if tag in {'replace', 'insert'}:
                for line in b[j1:j2]:
                    yield '+' + line

如您所见，它是

get_grouped_opcodes

类的

SequenceMatcher

方法，它迭代 diff 的块。

更准确地说，

get_grouped_opcodes

方法返回一个生成器，该生成器迭代识别两个文件之间匹配块的元组列表。每个元组具有以下格式：

(tag, i1, i2, j1, j2)

其中

tag

标识两个匹配代码块之间的关系（

tag

是

replace

、

delete

、

insert

或

equal

之一），

i1

和

i2

是索引，标识文件

中的代码块，

j1

和

j2

是标识文件

中相应代码块的索引。

unified_diff

函数以unidiff格式生成输出（https://en.wikipedia.org/wiki/Diff#Unified_format），但我们可以调整此函数来执行其他操作。

以编程方式迭代文本差异中的块

问题描述投票：0回答：1

1个回答

简短回答

示例解决方案

补充说明

最新问题

以编程方式迭代文本差异中的块

问题描述 投票：0回答：1

1个回答

简短回答

示例解决方案

补充说明

最新问题

问题描述投票：0回答：1