向amazon.in发送GET请求,但Web服务器以响应代码503响应,该怎么办?

问题描述 投票:0回答:1

这是我的代码:

整个脚本在最初的2-3次中运行良好,但现在不断发送503个响应

我多次检查过互联网,但互联网没有任何问题

from bs4 import BeautifulSoup
import requests, sys, os, json

def get_amazon_search_page(search):
    search = search.strip().replace(" ", "+")
    for i in range(3): # tries to connect and get request the amazon 3 times
        try:
            print("Searching...")
            response = requests.get("https://www.amazon.in/s?k={}&ref=nb_sb_noss".format(search)) # search string will be manipulated by replacing all spaces with "+" in order to search from the website itself
            print(response.status_code)
            if response.status_code == 200:
                return response.content, search
        except Exception:
            pass
    print("Is the search valid for the site: https://www.amazon.in/s?k={}&ref=nb_sb_noss".format(search))
    sys.exit(1)

def get_items_from_page(page_content):
    print(page_content)
    soup = BeautifulSoup(page_content, "html.parser") # soup for extracting information
    items = soup.find_all("span", class_ = "a-size-medium a-color-base a-text-normal")
    prices = soup.find_all("span", class_ = "a-price-whole")
    item_list = []
    total_price_of_all = 0
    for item, price in zip(items, prices):
        dict = {}
        dict["Name"] = item.text
        dict["Price"] = int(price.text)
        total_price_of_all += int(price.text.replace(",", ""))
        item_list.append(dict)
    average_price = total_price_of_all/len(item_list)
    file = open("items.json", "w")
    json.dump(item_list, file, indent = 4)
    print("Your search results are available in the items.json file")
    print("Average prices for the search: {}".format(average_price))
    file.close()

def main():
    os.system("clear")
    print("Note: Sometimes amazon site misbehaves by sending 503 responses, this can be due to heavy traffic on that site, please cooperate\n\n")
    search = input("Enter product name: ").strip()
    page_content = get_amazon_search_page(search)
    get_items_from_page(page_content)

if __name__ == "__main__":
    while True:
        main()

请帮助!

python-3.x beautifulsoup python-requests screen-scraping
1个回答
0
投票
服务器阻止您报废它。如果选中robots.txt,则可以看到您尝试请求的链接被禁止:Disallow: */s?k=*&rh=n*p_*p_*p_

但是,绕过此阻止的一种简单方法是更改​​您的User-Agent(请参阅here)。默认情况下,请求会发送类似“ python-requests / 2.22.0”的信息。将其更改为更类似于浏览器的内容将暂时起作用。

© www.soinside.com 2019 - 2024. All rights reserved.