工作要求我们及时了解海关法规的最新动态。我没有手动访问网站,而是尝试创建一个简单的网络爬虫,它可以访问定义的网站,获取那里的最新项目并将它们写入 Excel 文件。
import requests
import pandas as pd
import regex as re
import openpyxl
from bs4 import BeautifulSoup
urls = ['https://www.evofenedex.nl/actualiteiten/', 'https://douaneinfo.nl/index.php/nieuws']
myworkbook = openpyxl.load_workbook('output.xlsx')
worksheet = myworkbook.get_sheet_by_name('Output')
for index, url in enumerate(urls):
response = requests.get(url)
if response.status_code == 200:
#empty array to store the links in
links = []
#evofenedex
if index == 0:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find the elements containing the news items
news_items = soup.find_all('div', class_='block__content')
title_element = soup.find_all('p', class_="list__title")
date_element = soup.find_all('p', class_="list__subtitle")
x = 0
link_elements = []
for titles in title_element:
link_elements.append(soup.find_all('a', title=title_element[x].text))
x = x + 1
for link_element in link_elements:
reg_str = re.findall(r'"(.*?)"', str(link_element))
links.append(f"www.evofenedex.nl{reg_str[1]}")
#douaneinfo
if index == 1:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
news_items = soup.find_all('div', class_='content-category')
for item in news_items:
title_element = soup.find_all('th', class_="list-title")
date_element = soup.find_all('td', class_="list-date small")
for element in title_element:
element_string = str(element)
reg_str = re.findall(r'"(.*?)"', element_string)[2]
links.append(f"www.douaneinfo.nl{reg_str}")
if title_element and date_element:
y = 0
x = 1
z = 1
#Loops through elements to add them to the excel file
for element in title_element:
titleX = element.text.strip()
date = date_element[y].text.strip()
link = links[y]
cellref = worksheet.cell(row = x, column = z)
cellref.value = titleX
z = z + 1
cellref = worksheet.cell(row = x, column = z)
cellref.value = date
z = z + 1
cellref = worksheet.cell(row = x, column = z)
cellref.value = link
z = 1
y = y + 1
x = x + 1
myworkbook.save('output.xlsx')
print('The scraping is complete')
我遇到的问题是第一个网站无法获取最新信息,而是从几个月前的信息开始。
如果您访问第一个网站,我正在抓取的第一行数据(当前)位于新闻网址的第二页上。
数据来自 API 并动态呈现,因此至少有两个选项:
通过API使用
requests
获取您的数据:
import requests
json_data = requests.get('https://www.evofenedex.nl/api/v1/pages/news?page=1').json()
for item in json_data.get('Resources'):
print(
item.get('Resource').get('Title'),
item.get('Resource').get('Created'),
item.get('Resource').get('AbsolutePath')
)
使用模仿浏览器的
selenium
并等待内容渲染后进行处理。