python图像抓取工具,在bing上无法正常工作

问题描述 投票:2回答:1

我正在尝试制作图片抓取工具,我首先在Google上尝试过,但是没有图片被抓取因此,我尝试使用Bing并成功了,但存在一些问题

  1. 图像链接被抓取只是显示的一小部分在搜索引擎中。
  2. 废弃的图像来自所示预览中的未知页面。
  3. 默认情况下,图像会在安全模式过滤器中被废弃。

我想删除bing.com/images/search中显示的所有图像(或某些页面)但实际上,它的作用很小。

[检查后,我发现图像链接存储在bing的'thumb'类中,所以我废弃了所有具有thumb类的链接,但是看起来还不够。

在查看源代码之后,仅发现实际上以.jpg结尾的thumb类链接

import requests
from bs4 import BeautifulSoup
import os
import random
from urllib.parse import urljoin


url = "https://www.bing.com"

search = input("enter the search term: ")
r = requests.get(url + "/images/search", params={"q":search})

soup = BeautifulSoup(r.content,"html.parser")

li = soup.find_all("a",class_="thumb")

# getting links from thumb class  

links = [l.get("href") for l in li]


print("{0} results found with the search term: {1}".format(len(links), search))
choice = input("Do You Want To Extract The Images? Y or N ")
dir_name = "Result"

# Creating the Result named directory if it didn't existed
if os.path.isdir(dir_name) == False:
    print("[+] Creating Directory Named '{0}'".format(dir_name))
    os.mkdir(dir_name)

n = 1
if(choice == 'Y' or choice == 'y'):
    for i in links:
        req = requests.get(i)

        #title = links[z].split("/")[-1]
        #there were some issues with the default titles so I instead used names generated by
        #random sequence

        print("[+] Extracting Image #",n)
        with open(("{0}/" + generateRandomSequence() + ".jpg").format(dir_name),"wb") as img:
            img.write(req.content)
        n += 1

  #for generating random sequence
def generateRandomSequence():
    seq = ""
    letters = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",
               "A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",
                ]
    for i in range(0,5):
        seq = seq + random.choice(letters) + str(random.randrange(1,1000))

    return seq
python web beautifulsoup python-requests bing
1个回答
1
投票

这是给您的废料:

© www.soinside.com 2019 - 2024. All rights reserved.