读取文件时Python enumerate() tqdm 进度条？

Question

当我使用此代码迭代打开的文件时，我看不到 tqdm 进度条：

with open(file_path, 'r') as f:
    for i, line in enumerate(tqdm(f)):
        if i >= start and i <= end:
            print("line #: %s" % i)
            for i in tqdm(range(0, line_size, batch_size)):
                # pause if find a file naed pause at the currend dir
                re_batch = {}
                for j in range(batch_size):
                    re_batch[j] = re.search(line, last_span)

这里使用 tqdm 的正确方法是什么？

Answer 1

您走在正确的道路上。您正确使用 tqdm，但在使用 tqdm 时停止打印循环内的每一行。您还需要在第一个 for 循环而不是其他循环上使用 tqdm，如下所示：

with open(file_path, 'r') as f:
    for i, line in enumerate(tqdm(f)):
        if i >= start and i <= end:
            for i in range(0, line_size, batch_size):
                # pause if find a file naed pause at the currend dir
                re_batch = {}
                for j in range(batch_size):
                    re_batch[j] = re.search(line, last_span)

有关使用 enumerate 及其在 tqdm 中的用法的一些注意事项here。

Answer 2

我也遇到了这个问题 - tqdm没有显示进度条，因为尚未提供文件对象中的行数。

for

循环将迭代行，读取直到遇到下一个换行符。

为了向

tqdm

添加进度条，您首先需要扫描文件并计算行数，然后将其传递给tqdm作为

total

from tqdm import tqdm

num_lines = sum(1 for line in open('myfile.txt','r'))
with open('myfile.txt','r') as f:
    for line in tqdm(f, total=num_lines):
        print(line)

Answer 3

我正在尝试对包含所有维基百科文章的文件执行相同的操作。所以我不想在开始处理之前计算总行数。而且它是一个 bz2 压缩文件，因此解压缩行的 len 高估了该迭代中读取的字节数，所以...

with tqdm(total=Path(filepath).stat().st_size) as pbar:
    with bz2.open(filepath) as fin:
        for i, line in enumerate(fin):
            if not i % 1000:
                pbar.update(fin.tell() - pbar.n)
            # do something with the decompressed line
    # Debug-by-print to see the attributes of `pbar`: 
    # print(vars(pbar))

谢谢Yohan Kuanke删除的答案。如果版主取消删除它，你可以抄袭我的。

Answer 4

如果您正在读取非常大的文件，请尝试以下方法：

from tqdm import tqdm
import os

file_size = os.path.getsize(filename)
lines_read= []
pbar = tqdm.tqdm(total=file_zize, unit="MB")
with open(filename, 'r', encoding='UTF-8') as file:
    while (line := file.readline()):
        lines_read.append(line)
        pbar.update(s.getsizeof(line)-sys.getsizeof('\n'))
pbar.close()

我遗漏了您在

append(line)

之前可能想要进行的处理

编辑：

我将

len(line)

更改为

s.getsizeof(line)-sys.getsizeof('\n')

，因为

len(line)

并不能准确表示实际读取了多少字节（请参阅有关此的其他帖子）。但即使这也不是 100% 准确，因为 sys.getsizeof(line) 不是读取的实际字节长度，但如果文件非常大，它是一个“足够接近”的黑客。

我确实尝试使用 f.tell() 代替，并在 while 循环中减去文件 pos 增量，但使用非二进制文件的 f.tell 在 Python 3.8.10 中非常慢。

根据下面的链接，我还尝试将 f.tell() 与 Python 3.10 一起使用，但这仍然很慢。

如果有人有更好的策略，请随时编辑此答案，但请在编辑之前提供一些性能数据。请记住，对于非常大的文件，在执行循环之前计算行数是不可接受的，并且完全达不到显示进度条的目的（例如，尝试包含 3 亿行的 30Gb 文件）

为什么 Python 中 f.tell() 在以非二进制模式读取文件时速度很慢 https://bugs.python.org/issue11114

Answer 5

在使用

readlines()

读取文件的情况下，可以使用以下方法：

from tqdm import tqdm
with open(filename) as f:
    sentences = tqdm(f.readlines(),unit='MB')

unit='MB'

可以相应地更改为“B”或“KB”或“GB”。

读取文件时Python enumerate() tqdm 进度条？

问题描述投票：0回答：5

5个回答

最新问题

读取文件时Python enumerate() tqdm 进度条？

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5