使用 get_main_image 函数从维基媒体检索最高质量的图像很困难

问题描述 投票:0回答:1

我在 Python 脚本中遇到了

get_main_image
函数的问题,该脚本旨在从维基媒体中抓取图像。问题在于该功能下载较小图像而不是可用的最高质量版本的行为。

以下是该问题的简要概述:

  • get_main_image 函数负责从维基媒体检索和保存图像。
  • 但是,它似乎一直在下载较小或较低质量版本的图像。
  • 我的目标是修改该函数以确保它检索维基媒体上可用图像的最大和最清晰的版本。

我怀疑该函数识别和获取图像 URL 的方式可能存在缺陷,或者图像质量的选择过程可能存在缺陷。

下面是

get_main_image
函数的简化版本:

import requests

def get_main_image(wiki_link,article, save_dir, IMAGE_NUM):
  headers = {
    "Authorization": f"Bearer {access_token}",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
  }
  image_url = wiki_link + article.replace(" ", "_")
  image_name = str(IMAGE_NUM + 1)
  response = requests.get(image_url)
  soup = bs(response.text, 'html.parser')

  try:
    main_image_url = soup.find('img', alt=article).get('srcset')
    main_image_response = requests.get(url = main_image_url,headers= headers, stream=True)
  except Exception as e:
    #print(e)
    try:
      main_image_url = soup.find('img', alt=article).get('src')
      main_image_response = requests.get(url = main_image_url,headers= headers, stream=True)
    except:
      return image_url, None

  #print(article.replace(" ", "_")[5:])
  #print(article[-4:])
  if article[-4:] == ".svg":
    image = Image.open(BytesIO(main_image_response.content))
    image_name = image_name + ".png"
    save_path = save_dir + "//" + image_name 
    #print(article.replace(" ", "_")[5:] + ".png")
    image.save(save_path)
  elif article[-5:] == ".djvu":
    image = Image.open(BytesIO(main_image_response.content))
    image_name = image_name + ".jpg"
    save_path = save_dir + "//" + image_name
    #print(article.replace(" ", "_")[5:] + ".jpg")
    image.save(save_path)
  else:
    image = Image.open(BytesIO(main_image_response.content))
    image_name = image_name + article[-4:]
    save_path = save_dir + "//" + image_name
    #print(article.replace(" ", "_")[5:])
    #print("I haven't caused an error yet")
    try:
      image.save(save_path)
    except Exception as e:
      image_name = None
  return image_url, image_name

编辑: 例如,该图像(即 URL 上的主图像)在 4.33 mb 时下载为 1.09 mb。 https://commons.wikimedia.org/wiki/File:Map_of_Potential_Nuclear_Strike_Targets_(c._2015),_FEMA.png

python web-scraping python-requests wikipedia
1个回答
0
投票

如果我理解正确,您可以使用 Wikimedia Commons API 获取全尺寸图像的 URL,例如:

import requests
from bs4 import BeautifulSoup

api_url = "https://magnus-toolserver.toolforge.org/commonsapi.php"
image_name = "Map_of_Potential_Nuclear_Strike_Targets_(c._2015),_FEMA.png"

soup = BeautifulSoup(requests.get(api_url, params={"image": image_name}).content, "xml")
# print(soup.prettify())

print(soup.urls.file.text)

打印:

https://upload.wikimedia.org/wikipedia/commons/7/7e/Map_of_Potential_Nuclear_Strike_Targets_%28c._2015%29%2C_FEMA.png
© www.soinside.com 2019 - 2024. All rights reserved.