在Python中,有没有一种简洁的方法来比较两个文本文件的内容是否相同?

问题描述 投票:0回答:10

我不在乎有什么差异。我只是想知道内容是否不同。

python file compare
10个回答
95
投票

低级方式:

from __future__ import with_statement
with open(filename1) as f1:
   with open(filename2) as f2:
      if f1.read() == f2.read():
         ...

高级方式:

import filecmp
if filecmp.cmp(filename1, filename2, shallow=False):
   ...

35
投票

如果您只追求基本效率,您可能需要先检查文件大小:

if os.path.getsize(filename1) == os.path.getsize(filename2):
  if open('filename1','r').read() == open('filename2','r').read():
    # Files are the same.

这可以节省您读取两个文件的每一行的时间,这两个文件的大小甚至不同,因此不可能相同。

(甚至更进一步,您可以调用每个文件的快速 MD5sum 并进行比较,但这不是“在 Python 中”,所以我将在此停止。)


14
投票

这是一个函数式的文件比较功能。如果文件大小不同,它立即返回 False;否则,它会读取 4KiB 块大小,并在出现第一个差异时立即返回 False:

from __future__ import with_statement
import os
import itertools, functools, operator
try:
    izip= itertools.izip  # Python 2
except AttributeError:
    izip= zip  # Python 3

def filecmp(filename1, filename2):
    "Do the two files have exactly the same contents?"
    with open(filename1, "rb") as fp1, open(filename2, "rb") as fp2:
        if os.fstat(fp1.fileno()).st_size != os.fstat(fp2.fileno()).st_size:
            return False # different sizes ∴ not equal

        # set up one 4k-reader for each file
        fp1_reader= functools.partial(fp1.read, 4096)
        fp2_reader= functools.partial(fp2.read, 4096)

        # pair each 4k-chunk from the two readers while they do not return '' (EOF)
        cmp_pairs= izip(iter(fp1_reader, b''), iter(fp2_reader, b''))

        # return True for all pairs that are not equal
        inequalities= itertools.starmap(operator.ne, cmp_pairs)

        # voilà; any() stops at first True value
        return not any(inequalities)

if __name__ == "__main__":
    import sys
    print filecmp(sys.argv[1], sys.argv[2])

只是不同的看法:)


6
投票

由于我无法评论其他人的答案,所以我会写自己的答案。

如果你使用 md5,你绝对不能只使用 md5.update(f.read()) 因为你会使用太多内存。

def get_file_md5(f, chunk_size=8192):
    h = hashlib.md5()
    while True:
        chunk = f.read(chunk_size)
        if not chunk:
            break
        h.update(chunk)
    return h.hexdigest()

4
投票

我会使用 MD5 来计算文件内容的哈希值。

import hashlib

def checksum(f):
    md5 = hashlib.md5()
    md5.update(open(f).read())
    return md5.hexdigest()

def is_contents_same(f1, f2):
    return checksum(f1) == checksum(f2)

if not is_contents_same('foo.txt', 'bar.txt'):
    print 'The contents are not the same!'

2
投票

f = open(filename1, "r").read()
f2 = open(filename2,"r").read()
print f == f2



1
投票

对于较大的文件,您可以计算文件的 MD5SHA 哈希值。


1
投票
from __future__ import with_statement

filename1 = "G:\\test1.TXT"

filename2 = "G:\\test2.TXT"


with open(filename1) as f1:

   with open(filename2) as f2:

      file1list = f1.read().splitlines()

      file2list = f2.read().splitlines()

      list1length = len(file1list)

      list2length = len(file2list)

      if list1length == list2length:

          for index in range(len(file1list)):

              if file1list[index] == file2list[index]:

                   print file1list[index] + "==" + file2list[index]

              else:                  

                   print file1list[index] + "!=" + file2list[index]+" Not-Equel"

      else:

          print "difference inthe size of the file and number of lines"

0
投票

简单高效的解决方案:

import os


def is_file_content_equal(
    file_path_1: str, file_path_2: str, buffer_size: int = 1024 * 8
) -> bool:
    """Checks if two files content is equal
    Arguments:
        file_path_1 (str): Path to the first file
        file_path_2 (str): Path to the second file
        buffer_size (int): Size of the buffer to read the file
    Returns:
        bool that indicates if the file contents are equal
    Example:
        >>> is_file_content_equal("filecomp.py", "filecomp copy.py")
            True
        >>> is_file_content_equal("filecomp.py", "diagram.dio")
            False
    """
    # First check sizes
    s1, s2 = os.path.getsize(file_path_1), os.path.getsize(file_path_2)
    if s1 != s2:
        return False
    # If the sizes are the same check the content
    with open(file_path_1, "rb") as fp1, open(file_path_2, "rb") as fp2:
        while True:
            b1 = fp1.read(buffer_size)
            b2 = fp2.read(buffer_size)
            if b1 != b2:
                return False
            # if the content is the same and they are both empty bytes
            # the file is the same
            if not b1:
                return True

0
投票

filecmp
非常适合简单的解决方案,但不允许您打印文件中的行号或差异

这是一个简单而有效的解决方案,它更加灵活,您可以打印比较状态、行号以及文件中存在差异的行值:

def compare_with_line_diff(filename1, filename2):
    with open(filename1, "r") as file1, open(filename2, "r") as file2:

        # Loop for all lines in first file (keep only 2 lines in memory)
        for line_num, f1_line in enumerate(file1, start=1):

            # Only print status for range of lines
            if (line_num == 1 or line_num % 1000 == 0):
                print(f"comparing lines {line_num} to {line_num + 1000}")

            # Compare with next line of file2
            f2_line = file2.readline()
            if (f1_line != f2_line):
                print(f"Difference on line: {line_num}")
                print(f"f1_line: '{f1_line}'")
                print(f"f2_line: '{f2_line}'")
                return False

        # Check if file2 has more lines than file1
        for extra_line in file2:
            print(f"Difference on file2: {extra_line}")
            return False

    # Files are equal
    return True

您可以添加对文件大小、哈希等的检查。但是,当您想要获取第一个不同的行(无论文件大小如何)时,上面的方法很有用

© www.soinside.com 2019 - 2024. All rights reserved.