建议使用Python抓取网站

问题描述 投票:0回答:2

我正在尝试抓取以下网站,从中我要抓取三件事:1. href(超连结)2.发布日期3.文章描述。

website我设法抓取了“ href”,但我正在努力抓取发布日期和文章说明。请参阅以下有关我使用的代码:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://orangecyberdefense.com/global/blog/')
soup = BeautifulSoup(page.content, 'html.parser')

main_table = soup.find('section', attrs={'class':'section articles'})
links = main_table.find_all('a')

Hyperlinks = []
Date = []
Description = []

for link in links:
    Hyperlinks.append(link.attrs['href'])
    Date.append(link.attrs['time'])
    Description.append(link.attrs['description'])

我应该如何提取“日期”和“说明”?

python python-3.x beautifulsoup
2个回答
0
投票

在这种情况下,您可以使用zip()

例如:

import requests
from bs4 import BeautifulSoup

url = 'https://orangecyberdefense.com/global/blog/'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for title, tm, desc in zip(soup.select('p.card-title'), soup.select('time'), soup.select('.description')):
    print(title.get_text(strip=True), tm.get_text(strip=True))
    print('-' * 80)
    print(desc.get_text(strip=True))
    print()

打印:

Let's examine Cisco Webex - A visionary player 21 May. 2020
--------------------------------------------------------------------------------
CISCO WebEx is a common solution for webinars and videoconferencing. Does it live up to its reputation regarding security?

In-depth product analysis - Zoom & Microsoft Teams 07 May. 2020
--------------------------------------------------------------------------------
While these concerns are warranted, we feel that there has also been a fair amount of hyperbole involved, which was part of our motivation for writing this report.

Lessons learned: How COVID-19 has had a knock-on effect on our businesses 07 May. 2020
--------------------------------------------------------------------------------
In this final piece, we’ll look at how the impact of this pandemic and our collective response hold valuable lessons for security practitioners.

Video killed the conferencing star 06 May. 2020
--------------------------------------------------------------------------------
Videoconferencing is an essential tool, especially with the COVID-19-lockdown. Zoom, Teams, Webex, Skype: we have checked 10 business solutions for security.

COVID-19: when it’s all over 04 May. 2020
--------------------------------------------------------------------------------
Back to normality: these are the three main things we expect businesses will see when employees make the exodus back to their respective workplaces.

Star Wars Day: Orange Cyberdefense hacks the Death Star 04 May. 2020
--------------------------------------------------------------------------------
Discover our experts’ ploys to hack the galaxy’s most secure datacenter.

COVID-19: responding to the cyber part of the crisis 30 Apr. 2020
--------------------------------------------------------------------------------
We can’t control the threat, but we can control the vulnerability, so we should focus on that. Our guidelines for responding to the cyber crisis.

0
投票
import requests
from bs4 import BeautifulSoup
page = requests.get('https://orangecyberdefense.com/global/blog/')
soup = BeautifulSoup(page.content, 'html.parser')

Hyperlinks = []
Dates = []
Description = []

main_table = soup.find('section', attrs={'class':'section articles'})
links = main_table.find_all(['a'])

for link in links:
    Hyperlinks.append(link.attrs['href'])

我们只需使用find_all(['time'])查找所有时间标签

# find time tags & adding all the dates in the list
date_list = main_table.find_all(['time'])
for date in date_list:
    Dates.append(date.get_text())

日期输出:

['07 May. 2020',
 '07 May. 2020',
 '06 May. 2020',
 '04 May. 2020',
 '04 May. 2020',
 '30 Apr. 2020']
热门问题
推荐问题
最新问题