如何修复代码以刮取Zomato网站?

问题描述 投票:-1回答:1

我编写了这段代码,但是在运行最后一行后,将其作为错误“ IndexError:列表索引超出范围”。拜托,我该如何解决?

    import requests
    from bs4 import BeautifulSoup

    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, 
                                           like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
    response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)

    content = response.content
    soup = BeautifulSoup(content,"html.parser")

    top_rest = soup.find_all("div",attrs={"class": "sc-bblaLu dOXFUL"})
    list_tr = top_rest[0].find_all("div",attrs={"class": "sc-gTAwTn cKXlHE"})

list_rest =[]
for tr in list_tr:
    dataframe ={}
    dataframe["rest_name"] = (tr.find("div",attrs={"class": "res_title zblack bold nowrap"})).text.replace('\n', ' ')
    dataframe["rest_address"] = (tr.find("div",attrs={"class": "nowrap grey-text fontsize5 ttupper"})).text.replace('\n', ' ')
    dataframe["cuisine_type"] = (tr.find("div",attrs={"class":"nowrap grey-text"})).text.replace('\n', ' ')
    list_rest.append(dataframe)
list_rest
python python-3.x web-scraping data-science web-scraping-language
1个回答
0
投票

您收到此错误是因为当您尝试获取它的第一个元素“ top_rest [0]”时top_rest为空。这样做的原因是因为您尝试引用的第一个类是动态命名的。您将注意到,如果刷新页面,则该div的相同位置将不会被命名为相同的位置。因此,当您尝试刮擦时,会得到空的结果。

一种替代方法是抓取所有div,然后缩小所需元素的范围,注意动态div的命名模式,因此从一个请求到另一个请求,您将获得不同的结果:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)

content = response.content
soup = BeautifulSoup(content,"html.parser")

top_rest = soup.find_all("div")
list_tr = top_rest[0].find_all("div",attrs={"class": "bke1zw-1 eMsYsc"})
list_tr
© www.soinside.com 2019 - 2024. All rights reserved.