我正在使用基于Python的AI将扫描文档转换为文本。我必须处理 200k 个文件,但处理了大约 25k 个文件,我的操作系统由于 OOM 而杀死了 python 脚本。现在我想再次运行该脚本,但排除我已经处理过的所有文件。我创建的用于查找以下文件的代码示例
import os
import sys
from pathlib import Path
import itertools
companyfolder = sys.argv[1]
companypath = ("/home/user/download/" + companyfolder)
outputpath = ("/home/user/output/" + companyfolder + "/OCR")
errorpath = ("/home/user/output/" + companyfolder)
# run OCR loop
for file in itertools.chain(
Path(companypath).rglob("*.jpeg"),
Path(companypath).rglob("*.JPEG"),
Path(companypath).rglob("*.jpg"),
Path(companypath).rglob("*.JPG"),
Path(companypath).rglob("*.png"),
Path(companypath).rglob("*.PNG"),
Path(companypath).rglob("*.tif"),
Path(companypath).rglob("*.TIF"),
Path(companypath).rglob("*.tiff"),
Path(companypath).rglob("*.TIFF"),
Path(companypath).rglob("*.bmp"),
Path(companypath).rglob("*.BMP"),
Path(companypath).rglob("*.pdf"),
Path(companypath).rglob("*.PDF"),
):
try:
# make dirs and file path
print(file)
x more commands here and below
我有一个已处理的文件列表。现在我想从通配中排除此文件列表,以避免处理我已经处理过的文件。为了匹配模式,我删除了后缀,因为我的输入和输出后缀不同。下面是我要排除的文件列表中的几个文件的示例
/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/processed-450ce329-1f42-4e13-bf0f-db9e2ee33103_WGNv7mVl
/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/processed-db89fa50-7bba-4cf4-b898-c6839e2294be_vbsnLO4H
/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/Screenshot_20220826-142620_Office
/home/user/output/Google/OCR/Desktop/Desktop/Scans/2022-04-25 11_07_09-Window
/home/user/output/Google/OCR/Desktop/Desktop/SCANS SPENDINGS - INCOMING INVOICES/Q2 2022/january/list [Q2 2022]
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5736
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5957
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5761
我希望有人可以教我如何做到这一点
鉴于您了解这些文件,您实际上已经差不多了。
您真正需要做的就是将文件列表添加到
iterable
(列表、字典、集合),然后进行逻辑检查以查看每个文件是否已处理。
例子是这样的:
# with a set (for the remaining files)
processed_files = {'file1.ext', 'file2.ext'}
set_of_files = {'file1.ext', 'file2.ext', 'file3.ext'}
for file in set_of_files:
if file in processed_files:
print(f' {file} does not need to be processed')
else:
print(f' {file} needs to be processed')
这会产生这个:
file2.ext does not need to be processed
file1.ext does not need to be processed
file3.ext needs to be processed
或者这个:
# with a dict (if you know which are processed)
dict_of_files = {
'file1.ext':'processed',
'file2.ext':'processed',
'file3.ext':'failed'
}
for k, v in dict_of_files.items():
if v == 'failed': print(f'process file {k}')
else: print(f'{k} was already processed')
这会产生这个:
file1.ext was already processed
file2.ext was already processed
process file file3.ext