[写入md5行csv文件时计算md5-无需将其读入内存

Question

计算md5需要字节流通过。我假设有可能在写入一百万行时拦截csv.writer作为叮咬的流。在下面的py代码中，写入了一百万行，如何计算md5而又不将文件仅读取md5就读入内存？

def query2csv(connection, fileUri, sqlQuery, args):
    import csv
    tocsvfile = open(fileUri, 'w+')
    writer = csv.writer(tocsvfile, delimiter=',', quotechar='"') # , quoting=csv.QUOTE_MINIMAL
    #As a huge blob goes into writer, pass through, md5 how?
    # I do not want to read the huge file through memory just to compute md5
    with connection.cursor() as cur:
        cur.execute(sqlQuery, args)
        column_names = list(map(lambda x: x[0], cur.description))
        writer.writerow(column_names)
        writer.writerows(__batch_rows(cur))

Answer 1

[来自csv.writer的文档（重点是我：）>

csv.writer(csvfile, dialect='excel', **fmtparams)
返回一个writer对象，该对象负责将用户数据转换为给定文件状对象上的定界字符串。 [csvfile可以是具有write()方法的任何对象。
如果csvfile是文件对象，则应使用newline=''打开它。
因此，我们可以拦截对.write()的调用，并将数据馈送到MD5流中，同时还将其传递给真实文件。最干净的方法是使用write方法定义一个仅调用某些函数的类（即，一个用于MD5流，一个用于文件对象）：

import csv
import hashlib

class WriterTee:
    def __init__(self, *outs):
        self.outs = outs

    def write(self, s):
        for f in self.outs:
            f(s)

def query2csv(connection, fileUri, sqlQuery, args):
    md5 = hashlib.md5()

    with open(fileUri, 'w+', newline='') as tocsvfile, connection.cursor() as cur:
        tee = WriterTee(
            tocsvfile.write,
            lambda s: md5.update(s.encode())
        )

        writer = csv.writer(tee, delimiter=',', quotechar='"')

        cur.execute(sqlQuery, args)
        column_names = list(map(lambda x: x[0], cur.description))
        writer.writerow(column_names)
        writer.writerows(__batch_rows(cur))

    return md5.hexdigest()
[我作了一些其他更改，以管理with块中的两种资源，并按照文档说的那样使用newline=''。

顺便说一句，我建议您不要选择出于任何目的使用MD5。 MD5是不安全的，密码学家一直建议不要使用它since 1996。即使您不认为安全属性与您的应用程序相关，使用安全哈希算法也没有不利之处，并且hashlib API与您选择的任何算法都是相同的。

[写入md5行csv文件时计算md5-无需将其读入内存

问题描述投票：0回答：1

1个回答

最新问题

[写入md5行csv文件时计算md5-无需将其读入内存

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1