用于批量下载文件的ThreadPoolExecutor任务仅下载列表中的最后一个文件

问题描述 投票:0回答:1

我有一个名为

images.txt
的文件,它只是图像 URL 列表,每行一个:

https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Glenn_Jacobs_%2853122237030%29_-_Cropped.jpg/440px-Glenn_Jacobs_%2853122237030%29_-_Cropped.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Kane2003.jpg/340px-Kane2003.jpg
https://upload.wikimedia.org/wikipedia/commons/7/7a/Steel_Cage.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Kane_2008.JPG/340px-Kane_2008.JPG
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Brothers_of_Destruction.jpg/440px-Brothers_of_Destruction.jpg

我想启动一个带有

20
工作线程的线程池执行器,它在单独的线程中将每个图像下载到名为“
images
”的本地子目录中。这是我正在尝试的代码。问题是 - 它打印出正在下载所有图像,但最终只有一张图像 - 列表中的最后一张被下载,其余的永远不会下载。

from os import makedirs
from os.path import basename
from os.path import join
import shutil
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed

import requests

# load a file from a URL, returns content of downloaded file
def download_url(urlpath, dir):
    # Set the headers for the request
    headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Language": "en-US,en;q=0.9"
    }

    r = requests.get(urlpath, headers=headers, stream=True)
    filename = basename(urlpath)
    outpath = join(dir, filename)
    if r.status_code == 200:
        with open(outpath, 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)  
 
# download one file to a local directory
def download_url_to_file(link, path):
    download_url(link, path)
    return link
 
# download all files on the provided webpage to the provided path
def getInBulk(filePath, path):
    # download the html webpage
    # create a local directory to save files
    makedirs(path, exist_ok=True)
    # parse html and retrieve all href urls listed
    links = open(filePath).readlines()
    # report progress
    print(f'Found {len(links)} links')
    # create the pool of worker threads
    with ThreadPoolExecutor(max_workers=20) as exe:
        # dispatch all download tasks to worker threads
        futures = [exe.submit(download_url_to_file, link, path) for link in links]
        # report results as they become available
        for future in as_completed(futures):
            # retrieve result
            link = future.result()
            # check for a link that was skipped
            print(f'Downloaded {link} to directory')
 
PATH = 'images'
filePath = "images.txt"
getInBulk(filePath, PATH) 

知道我做错了什么以及如何解决它吗?

python multithreading python-requests urllib threadpoolexecutor
1个回答
0
投票
for link in open('images.txt').readlines():
    print([link])

运行此代码,您将看到每一行都以

\n
(换行符)结尾,但最后一行除外。因此前四个得到的响应不是
200
(
r.status_code
)。

只需更改此行:

links = [link.strip() for link in open(filePath).readlines()]

你的代码对我有用。

它打印出正在下载所有图像

因为您调用的“目标函数”(

download_url_to_file
)只返回一个字符串,它可以表示任何内容,而不是一些有意义的值,例如响应代码。

© www.soinside.com 2019 - 2024. All rights reserved.