如何从beautifulsoup列表中读取链接？

Question

我有一个包含大量链接的列表，我想用Python 3中的beautifulsoup来删除它们

链接是我的列表，它包含数百个urls。我已经尝试过这个代码来解决所有这些问题，但由于某种原因它不起作用

 links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html',...]

raw = urlopen(i in links).read()
ufos_doc = BeautifulSoup(raw, "html.parser")

Answer 1

raw应该是一个包含每个网页数据的列表。对于raw中的每个条目，解析它并创建一个汤对象。您可以将每个汤对象存储在一个列表中（我称之为soups）：

links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']

raw = [urlopen(i).read() for i in links]
soups = []
for page in raw:
    soups.append(BeautifulSoup(page,'html.parser'))

然后你可以访问例如。与soups[0]第一个链接的汤对象。

另外，要获取每个URL的响应，请考虑使用requests模块而不是urllib。见this post。

Answer 2

您需要在列表链接上循环。如果您有很多这样做，如其他答案所述，请考虑requests。使用requests，您可以创建一个Session对象，它将允许您重新使用连接，从而更有效地刮取

import requests
from bs4 import BeautifulSoup as bs

links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']

with requests.Session as s:
    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        #do something

如何从beautifulsoup列表中读取链接？

问题描述投票：0回答：2

2个回答

最新问题

如何从beautifulsoup列表中读取链接？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2