Python webscraper 没有抓取最新信息

问题描述 投票:0回答:1

工作要求我们及时了解海关法规的最新动态。我没有手动访问网站,而是尝试创建一个简单的网络爬虫,它可以访问定义的网站,获取那里的最新项目并将它们写入 Excel 文件。

import requests
import pandas as pd
import regex as re
import openpyxl
from bs4 import BeautifulSoup

urls = ['https://www.evofenedex.nl/actualiteiten/', 'https://douaneinfo.nl/index.php/nieuws']

myworkbook = openpyxl.load_workbook('output.xlsx')
worksheet = myworkbook.get_sheet_by_name('Output')

for index, url in enumerate(urls):
    response = requests.get(url)
    if response.status_code == 200:
        #empty array to store the links in
        links = []

        #evofenedex    
        if index == 0:
            # Parse the HTML content
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Find the elements containing the news items
            news_items = soup.find_all('div', class_='block__content')
            title_element = soup.find_all('p', class_="list__title")
            date_element = soup.find_all('p', class_="list__subtitle")

            x = 0
            link_elements = []
            for titles in title_element:
                link_elements.append(soup.find_all('a', title=title_element[x].text))
                x = x + 1

            for link_element in link_elements:
                reg_str = re.findall(r'"(.*?)"', str(link_element))
                links.append(f"www.evofenedex.nl{reg_str[1]}")

        #douaneinfo
        if index == 1:
            # Parse the HTML content
            soup = BeautifulSoup(response.text, 'html.parser')

            news_items = soup.find_all('div', class_='content-category')
            for item in news_items:
                title_element = soup.find_all('th', class_="list-title")
                date_element = soup.find_all('td', class_="list-date small")
                for element in title_element:
                    element_string = str(element)
                    reg_str = re.findall(r'"(.*?)"', element_string)[2]
                    links.append(f"www.douaneinfo.nl{reg_str}")

        if title_element and date_element:
            y = 0
            x = 1
            z = 1
            #Loops through elements to add them to the excel file
            for element in title_element:
                titleX = element.text.strip()
                date = date_element[y].text.strip()
                link = links[y]

                cellref = worksheet.cell(row = x, column = z)
                cellref.value = titleX
                z = z + 1
                cellref = worksheet.cell(row = x, column = z)
                cellref.value = date
                z = z + 1
                cellref = worksheet.cell(row = x, column = z)
                cellref.value = link
                z = 1
                y = y + 1
                x = x + 1

myworkbook.save('output.xlsx')
print('The scraping is complete')

我遇到的问题是第一个网站无法获取最新信息,而是从几个月前的信息开始。

如果您访问第一个网站,我正在抓取的第一行数据(当前)位于新闻网址的第二页上。

python web-scraping beautifulsoup python-requests
1个回答
0
投票

数据来自 API 并动态呈现,因此至少有两个选项:

  1. 通过API使用

    requests
    获取您的数据:

    import requests
    
    json_data = requests.get('https://www.evofenedex.nl/api/v1/pages/news?page=1').json()
    
    for item in json_data.get('Resources'):
        print(
            item.get('Resource').get('Title'),
            item.get('Resource').get('Created'),
            item.get('Resource').get('AbsolutePath')
        )
    
  2. 使用模仿浏览器的

    selenium
    并等待内容渲染后进行处理。

© www.soinside.com 2019 - 2024. All rights reserved.