我有一个 html_scraper,它接受三个参数:URL、速率限制和目标元素选择器。 for 循环为 url_list 中的每个 URL 调用 scraper 函数。我希望该函数返回每个页面的抓取内容,但它只返回循环中最后一个 URL 的抓取内容。收集每个循环的 page_contents 的列表位于 for 循环之外,那么我在这里做错了什么?
from bs4 import BeautifulSoup
import lxml
import requests
scraped_html = []
def html_scraper(url, ratelimit=1.0, target_element_selector='example-id-0'):
page_contents = {
'pageTitle': None,
'content': None
}
result = requests.get(url).text
soup = BeautifulSoup(result, 'lxml')
page_contents['pageTitle'] = soup.find('h1')
page_contents['content'] = soup.find(id=target_element_selector)
time.sleep(ratelimit)
return page_contents
if __name__ == '__main__':
url_list = [
'https://example.com/page-1',
'https://example.com/page-2',
'https://example.com/page-3',
]
for url in url_list:
try:
scraped = html_scraper(url, 0.5, 'example-id-1')
scraped_html.append(scraped)
except Exception as e:
print(e)
print(scraped_html)
# [{'pageTitle': None, 'content': None}, {'pageTitle': None, 'content': None}, {'pageTitle': Example Page 3 Title, 'content': <div id="example-id-1">Blah-blah-blah-blah-blah</div>}]
@JohnGordon 的答案应归功于@JohnGordon。他认为问题可能是由
requests
模块引起的。于是,我把requests.get(url).text
换成了urllib.request.urlopen(url)
,问题就解决了。当 html_scraper 函数仅用于一页时,即不在 for 循环中,request
模块可以工作,但在 for 循环中,它无法识别除最后一次迭代之外的所有内容。再次感谢@JohnGordon!