无法让我的脚本仅从顽固的网站获取下一页的链接

Question

我用 python 创建了一个脚本，仅从遍历多个页面的网站中抓取到不同餐厅的链接。我可以通过查看右上角的特定文本来查看有多少个链接，例如：

显示 1-30 共 18891

但我无法手动或使用脚本跳过此链接。该网站在每个分页中将其内容增加 30。

到目前为止我已经尝试过：

import requests
from bs4 import BeautifulSoup

link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start={}'

for page in range(960,1920,30): # modified the range to reproduce the issue

    resp = requests.get(link.format(page),headers={"User-Agent":"Mozilla/5.0"})

    print(resp.status_code,resp.url)

    soup = BeautifulSoup(resp.text, "lxml")
    for items in soup.select("li[class^='lemon--li__']"):

        if not items.select_one("h3 > a[href^='/biz/']"):continue
        lead_link = items.select_one("h3 > a[href^='/biz/']").get("href")
        print(lead_link)

上面的脚本仅从其登陆页面获取链接。

我怎样才能从其他页面获取链接？

Answer 1

该页之后没有数据。

您的代码应修改为以下内容 -

import requests
from bs4 import BeautifulSoup

link = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start={}"

for page in range(0, 960, 30):  # modified the range to reproduce the issue

    resp = requests.get(link.format(page), headers={"User-Agent": "Mozilla/5.0"})

    print(resp.status_code, resp.url)

    soup = BeautifulSoup(resp.text, "lxml")
    for items in soup.select("li[class^='lemon--li__']"):

        if not items.select_one("h3 > a[href^='/biz/']"):
            continue
        lead_link = items.select_one("h3 > a[href^='/biz/']").get("href")
        print(lead_link)

Answer 2

Yelp 故意阻止您这样做，试图避免您正在做的事情，因为我预计很多人会尝试为他们的网站编写爬虫。

https://www.yelp.com/robots.txt甚至有一个异想天开的介绍，并特别提到爬行，你应该联系他们。

因此，如果您确实需要数据，请联系他们，或者尝试其他可能会被忽视的事情，例如按照评论中的建议过滤郊区。

无论如何，简单的答案是，yelp 不允许你尝试做的事情，所以通过这种方式，这是不可能的。

Answer 3

具体搜索参数网站中只有24页，我建议检查是否有下一页

这将阻止您向不存在的结果发出请求，因为您可以看到您发送的链接的最后一页是

如果您收到 403 或类似错误，请设置用户代理标头和其他标头，您还可以使用 requestez 库自动设置这些标头。

无法让我的脚本仅从顽固的网站获取下一页的链接

问题描述投票：0回答：3

3个回答

最新问题

无法让我的脚本仅从顽固的网站获取下一页的链接

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3