从路径库 glob 中排除文件/模式列表

Question

我正在使用基于Python的AI将扫描文档转换为文本。我必须处理 200k 个文件，但处理了大约 25k 个文件，我的操作系统由于 OOM 而杀死了 python 脚本。现在我想再次运行该脚本，但排除我已经处理过的所有文件。我创建的用于查找以下文件的代码示例

import os
import sys
from pathlib import Path
import itertools

companyfolder = sys.argv[1]
companypath = ("/home/user/download/" + companyfolder)
outputpath = ("/home/user/output/" + companyfolder + "/OCR")
errorpath = ("/home/user/output/" + companyfolder)

# run OCR loop
for file in itertools.chain(
    Path(companypath).rglob("*.jpeg"),
    Path(companypath).rglob("*.JPEG"),
    Path(companypath).rglob("*.jpg"),
    Path(companypath).rglob("*.JPG"),
    Path(companypath).rglob("*.png"),
    Path(companypath).rglob("*.PNG"),
    Path(companypath).rglob("*.tif"),
    Path(companypath).rglob("*.TIF"),
    Path(companypath).rglob("*.tiff"),
    Path(companypath).rglob("*.TIFF"),
    Path(companypath).rglob("*.bmp"),
    Path(companypath).rglob("*.BMP"),
    Path(companypath).rglob("*.pdf"),
    Path(companypath).rglob("*.PDF"),
):
    try:
        # make dirs and file path
        print(file)
        x more commands here and below

我有一个已处理的文件列表。现在我想从通配中排除此文件列表，以避免处理我已经处理过的文件。为了匹配模式，我删除了后缀，因为我的输入和输出后缀不同。下面是我要排除的文件列表中的几个文件的示例

/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/processed-450ce329-1f42-4e13-bf0f-db9e2ee33103_WGNv7mVl
/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/processed-db89fa50-7bba-4cf4-b898-c6839e2294be_vbsnLO4H
/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/Screenshot_20220826-142620_Office
/home/user/output/Google/OCR/Desktop/Desktop/Scans/2022-04-25 11_07_09-Window
/home/user/output/Google/OCR/Desktop/Desktop/SCANS SPENDINGS - INCOMING INVOICES/Q2 2022/january/list [Q2 2022]
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5736
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5957
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5761

我希望有人可以教我如何做到这一点

Answer 1

鉴于您了解这些文件，您实际上已经差不多了。

您真正需要做的就是将文件列表添加到

iterable

（列表、字典、集合），然后进行逻辑检查以查看每个文件是否已处理。

例子是这样的：

# with a set (for the remaining files)
processed_files = {'file1.ext', 'file2.ext'}
set_of_files = {'file1.ext', 'file2.ext', 'file3.ext'}

for file in set_of_files:
    if file in processed_files:
        print(f' {file} does not need to be processed')
    else:
        print(f' {file} needs to be processed')

这会产生这个：

 file2.ext does not need to be processed
 file1.ext does not need to be processed
 file3.ext needs to be processed

或者这个：

# with a dict (if you know which are processed)
dict_of_files = {
    'file1.ext':'processed', 
    'file2.ext':'processed', 
    'file3.ext':'failed'
    }

for k, v in dict_of_files.items():
    if v == 'failed': print(f'process file {k}')
    else: print(f'{k} was already processed')

这会产生这个：

file1.ext was already processed
file2.ext was already processed
process file file3.ext

从路径库 glob 中排除文件/模式列表

问题描述投票：0回答：1

1个回答

最新问题

从路径库 glob 中排除文件/模式列表

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1