我有一个名为
images.txt
的文件,它只是图像 URL 列表,每行一个:
https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Glenn_Jacobs_%2853122237030%29_-_Cropped.jpg/440px-Glenn_Jacobs_%2853122237030%29_-_Cropped.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Kane2003.jpg/340px-Kane2003.jpg
https://upload.wikimedia.org/wikipedia/commons/7/7a/Steel_Cage.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Kane_2008.JPG/340px-Kane_2008.JPG
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Brothers_of_Destruction.jpg/440px-Brothers_of_Destruction.jpg
我想启动一个带有
20
工作线程的线程池执行器,它在单独的线程中将每个图像下载到名为“images
”的本地子目录中。这是我正在尝试的代码。问题是 - 它打印出正在下载所有图像,但最终只有一张图像 - 列表中的最后一张被下载,其余的永远不会下载。
from os import makedirs
from os.path import basename
from os.path import join
import shutil
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
import requests
# load a file from a URL, returns content of downloaded file
def download_url(urlpath, dir):
# Set the headers for the request
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9"
}
r = requests.get(urlpath, headers=headers, stream=True)
filename = basename(urlpath)
outpath = join(dir, filename)
if r.status_code == 200:
with open(outpath, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
# download one file to a local directory
def download_url_to_file(link, path):
download_url(link, path)
return link
# download all files on the provided webpage to the provided path
def getInBulk(filePath, path):
# download the html webpage
# create a local directory to save files
makedirs(path, exist_ok=True)
# parse html and retrieve all href urls listed
links = open(filePath).readlines()
# report progress
print(f'Found {len(links)} links')
# create the pool of worker threads
with ThreadPoolExecutor(max_workers=20) as exe:
# dispatch all download tasks to worker threads
futures = [exe.submit(download_url_to_file, link, path) for link in links]
# report results as they become available
for future in as_completed(futures):
# retrieve result
link = future.result()
# check for a link that was skipped
print(f'Downloaded {link} to directory')
PATH = 'images'
filePath = "images.txt"
getInBulk(filePath, PATH)
知道我做错了什么以及如何解决它吗?
for link in open('images.txt').readlines():
print([link])
运行此代码,您将看到每一行都以
\n
(换行符)结尾,但最后一行除外。因此前四个得到的响应不是 200
(r.status_code
)。
只需更改此行:
links = [link.strip() for link in open(filePath).readlines()]
你的代码对我有用。
它打印出正在下载所有图像
因为您调用的“目标函数”(
download_url_to_file
)只返回一个字符串,它可以表示任何内容,而不是一些有意义的值,例如响应代码。