尝试抓取任何地方或餐馆的谷歌首页地址,但不吉利

问题描述 投票:1回答:1

试图从谷歌首页信息面板抓取餐馆地址,但得到“urllib.error.HTTPError:HTTP错误403:禁止”错误和程序不运行。我对python web scraping更新,请帮忙。

    import urllib.request, urllib.parse, urllib.error
    from bs4 import BeautifulSoup
    import ssl
    import json
    import re
    import sys
    import warnings

    if not sys.warnoptions:
        warnings.simplefilter("ignore")

   #get google URL.
    url = "https://www.google.com/search?q=barbeque%20nation%20-%20noida"
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)

    page = fromstring(response)

    soup = BeautifulSoup(page, 'url.parser')

    the_page = soup.prettify("utf-8")
    hotel_json = {}

    for line in soup.find_all('script',attrs={"type" : 
    "application/ld+json"}):
        details = line.text.strip()
        details = json.loads(details)

        hotel_json["address"]["LrzXr"]=details["address"]["streetAddress"]

        break
    with open(hotel_json["name"]+".html", "wb") as file:
        file.write(html)

    with open(hotel_json["name"]+".json", 'w') as outfile:
        json.dump(hotel_json, outfile, indent=4)   
python python-3.x web-scraping beautifulsoup google-crawlers
1个回答
0
投票

添加用户代理标头

request = urllib.request.Request(url, headers = {'User-Agent' : 'Mozilla/5.0'})
© www.soinside.com 2019 - 2024. All rights reserved.