抓取网页并加载更多

问题描述 投票:0回答:1

我正在尝试抓取网站:每日一次。我无法抓取某个关键字的所有新闻标题和链接,例如“芭比娃娃”。 “加载更多”后我无法抓取信息。我可以采取什么不同的措施才能获得所有标题和链接?

def get_news_of(keyword: str) -> List[Tuple[str, str]]:
    base_url = "https://www.dailynews.com"
    search_url = f"{base_url}/?s={keyword}"
    response = requests.get(search_url)
    
    headlines = []
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        articles = soup.find_all("article")

        for article in articles:
            headline_elem = article.find("h2", class_="entry-title")
            if headline_elem:
                headline_text = headline_elem.get_text().strip()
                article_link = article.find("a")["href"].strip()
                if keyword.lower() in headline_text.lower():
                    headlines.append((article_link, headline_text))

    return headlines

python web-scraping beautifulsoup
1个回答
0
投票

当您按下

load more
按钮时,它会从 https://www.dailynews.com/page/2/?s=keyword 获取请求。因此,您可以在代码中包含页码变量并循环查找所有页码,直到找到所有页码。

您更新后的代码应如下所示:

def get_news_of(keyword: str) -> List[Tuple[str, str]]:
    base_url = "https://www.dailynews.com"
    page_number = 1
    headlines = []

    while True:
        search_url = f"{base_url}/page/{page_number}/?s={keyword}"
        response = requests.get(search_url)
        
        if response.status_code == 404:  # This happens if the page number is > the number of pages
            break

        soup = BeautifulSoup(response.content, "html.parser")
        articles = soup.find_all("article")

        for article in articles:
            headline_elem = article.find("h2", class_="entry-title")
            if headline_elem:
                headline_text = headline_elem.get_text().strip()
                article_link = article.find("a")["href"].strip()
                if keyword.lower() in headline_text.lower():
                    headlines.append((article_link, headline_text))

        page_number += 1

    return headlines
© www.soinside.com 2019 - 2024. All rights reserved.