我正在尝试抓取网站:每日一次。我无法抓取某个关键字的所有新闻标题和链接,例如“芭比娃娃”。 “加载更多”后我无法抓取信息。我可以采取什么不同的措施才能获得所有标题和链接?
def get_news_of(keyword: str) -> List[Tuple[str, str]]:
base_url = "https://www.dailynews.com"
search_url = f"{base_url}/?s={keyword}"
response = requests.get(search_url)
headlines = []
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
articles = soup.find_all("article")
for article in articles:
headline_elem = article.find("h2", class_="entry-title")
if headline_elem:
headline_text = headline_elem.get_text().strip()
article_link = article.find("a")["href"].strip()
if keyword.lower() in headline_text.lower():
headlines.append((article_link, headline_text))
return headlines
当您按下
load more
按钮时,它会从 https://www.dailynews.com/page/2/?s=keyword 获取请求。因此,您可以在代码中包含页码变量并循环查找所有页码,直到找到所有页码。
您更新后的代码应如下所示:
def get_news_of(keyword: str) -> List[Tuple[str, str]]:
base_url = "https://www.dailynews.com"
page_number = 1
headlines = []
while True:
search_url = f"{base_url}/page/{page_number}/?s={keyword}"
response = requests.get(search_url)
if response.status_code == 404: # This happens if the page number is > the number of pages
break
soup = BeautifulSoup(response.content, "html.parser")
articles = soup.find_all("article")
for article in articles:
headline_elem = article.find("h2", class_="entry-title")
if headline_elem:
headline_text = headline_elem.get_text().strip()
article_link = article.find("a")["href"].strip()
if keyword.lower() in headline_text.lower():
headlines.append((article_link, headline_text))
page_number += 1
return headlines