抓取网页并加载更多

Question

我正在尝试抓取网站：每日一次。我无法抓取某个关键字的所有新闻标题和链接，例如“芭比娃娃”。 “加载更多”后我无法抓取信息。我可以采取什么不同的措施才能获得所有标题和链接？

def get_news_of(keyword: str) -> List[Tuple[str, str]]:
    base_url = "https://www.dailynews.com"
    search_url = f"{base_url}/?s={keyword}"
    response = requests.get(search_url)
    
    headlines = []
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        articles = soup.find_all("article")

        for article in articles:
            headline_elem = article.find("h2", class_="entry-title")
            if headline_elem:
                headline_text = headline_elem.get_text().strip()
                article_link = article.find("a")["href"].strip()
                if keyword.lower() in headline_text.lower():
                    headlines.append((article_link, headline_text))

    return headlines

Answer 1

当您按下

load more

按钮时，它会从 https://www.dailynews.com/page/2/?s=keyword 获取请求。因此，您可以在代码中包含页码变量并循环查找所有页码，直到找到所有页码。

您更新后的代码应如下所示：

def get_news_of(keyword: str) -> List[Tuple[str, str]]:
    base_url = "https://www.dailynews.com"
    page_number = 1
    headlines = []

    while True:
        search_url = f"{base_url}/page/{page_number}/?s={keyword}"
        response = requests.get(search_url)
        
        if response.status_code == 404:  # This happens if the page number is > the number of pages
            break

        soup = BeautifulSoup(response.content, "html.parser")
        articles = soup.find_all("article")

        for article in articles:
            headline_elem = article.find("h2", class_="entry-title")
            if headline_elem:
                headline_text = headline_elem.get_text().strip()
                article_link = article.find("a")["href"].strip()
                if keyword.lower() in headline_text.lower():
                    headlines.append((article_link, headline_text))

        page_number += 1

    return headlines

抓取网页并加载更多

问题描述投票：0回答：1

1个回答

最新问题

抓取网页并加载更多

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1