以编程方式迭代文本差异中的块

问题描述 投票:0回答:1

我正在用 python 编程,我想知道是否有一种简单的方法来迭代(和解析)由

diff
实用程序输出的文本差异的各个块。

伪Python代码:

textdiff = diff('file1', 'file2') # Pretend this is a way of producing a diff
for chunk in textdiff:
    print('These lines are deleted: ', chunk.deletions)
    ... # And do some other interesting stuff with the extracted contents of the chunk.
python diff patch
1个回答
0
投票

简短回答

使用

SequenceMatcher
库中的
difflib
类。这是
difflib
函数使用的类。直接使用此类将允许您迭代 diff 的“块”,而无需解析任何 diff 函数的结果。

示例解决方案

这是一个示例脚本来说明这个想法:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Iterate over the chunks of a diff between two files."""

from difflib import SequenceMatcher

def diff(a, b):
    """Iterate over the chunks of a diff between two files.
    
    Examples:

    Replace a single line:

    >>> for chunk in diff(["1\\n"], ["2\\n"]):
    ...     print(chunk)
    {'tag': 'replace', 'i1': 0, 'i2': 1, 'j1': 0, 'j2': 1, 'a': ['1\\n'], 'b': ['2\\n']}

    Delete a single line:

    >>> for chunk in diff(["1\\n"], []):
    ...     print(chunk)
    {'tag': 'delete', 'i1': 0, 'i2': 1, 'j1': 0, 'j2': 0, 'a': ['1\\n'], 'b': []}

    Insert a single line:

    >>> for chunk in diff([], ["2\\n"]):
    ...     print(chunk)
    {'tag': 'insert', 'i1': 0, 'i2': 0, 'j1': 0, 'j2': 1, 'a': [], 'b': ['2\\n']}
    """

    # Iterate over the groups of opcodes and yield the chunks as dicts
    for group in SequenceMatcher(a=a, b=b).get_grouped_opcodes(n=0):
        for tag, i1, i2, j1, j2 in group:
            if tag in ('replace', 'delete', 'insert'):
                patch = {
                    "tag": tag,
                    "i1": i1,
                    "i2": i2,
                    "j1": j1,
                    "j2": j2,
                    "a": a[i1:i2],
                    "b": b[j1:j2],
                }
                yield patch

if __name__ == "__main__":

    import sys

    # Get the contents of the first file as a list of lines
    a = open(sys.argv[1]).readlines()

    # Get the contents of the second file as a list of lines
    b = open(sys.argv[2]).readlines()

    # Iterate over the chunks of the diff and print them
    for chunk in diff(a, b):
        print(chunk)

补充说明

这里是

difflib
库的源代码https://github.com/python/cpython/blob/3.12/Lib/difflib.py

查看

difflib
源代码,似乎它的diff函数依赖于
SequenceMatcher
类。此类提供了迭代差异“块”的方法。

例如,考虑

unified_diff
函数,可在此处找到:https://github.com/python/cpython/blob/12a30bc1aa0586308bf3fe12c915bcc5e54a032f/Lib/difflib.py#L1095-L1161)

这是

unified_diff
的实现:

def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
                 tofiledate='', n=3, lineterm='\n'):

    _check_types(a, b, fromfile, tofile, fromfiledate, tofiledate, lineterm)
    started = False
    for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
        if not started:
            started = True
            fromdate = '\t{}'.format(fromfiledate) if fromfiledate else ''
            todate = '\t{}'.format(tofiledate) if tofiledate else ''
            yield '--- {}{}{}'.format(fromfile, fromdate, lineterm)
            yield '+++ {}{}{}'.format(tofile, todate, lineterm)

        first, last = group[0], group[-1]
        file1_range = _format_range_unified(first[1], last[2])
        file2_range = _format_range_unified(first[3], last[4])
        yield '@@ -{} +{} @@{}'.format(file1_range, file2_range, lineterm)

        for tag, i1, i2, j1, j2 in group:
            if tag == 'equal':
                for line in a[i1:i2]:
                    yield ' ' + line
                continue
            if tag in {'replace', 'delete'}:
                for line in a[i1:i2]:
                    yield '-' + line
            if tag in {'replace', 'insert'}:
                for line in b[j1:j2]:
                    yield '+' + line

如您所见,它是

get_grouped_opcodes
类的
SequenceMatcher
方法,它迭代 diff 的块。

更准确地说,

get_grouped_opcodes
方法返回一个生成器,该生成器迭代识别两个文件之间匹配块的元组列表。每个元组具有以下格式:

(tag, i1, i2, j1, j2)

其中

tag
标识两个匹配代码块之间的关系(
tag
replace
delete
insert
equal
之一),
i1
i2
是索引,标识文件
a
中的代码块,
j1
j2
是标识文件
b
中相应代码块的索引。

unified_diff
函数以unidiff格式生成输出(https://en.wikipedia.org/wiki/Diff#Unified_format),但我们可以调整此函数来执行其他操作。

© www.soinside.com 2019 - 2024. All rights reserved.