在for循环中调用的BeautifulSoup函数仅返回最后抓取的内容，忽略所有其他内容

Question

我有一个 html_scraper，它接受三个参数：URL、速率限制和目标元素选择器。 for 循环为 url_list 中的每个 URL 调用 scraper 函数。我希望该函数返回每个页面的抓取内容，但它只返回循环中最后一个 URL 的抓取内容。收集每个循环的 page_contents 的列表位于 for 循环之外，那么我在这里做错了什么？

from bs4 import BeautifulSoup
import lxml
import requests

scraped_html = []

def html_scraper(url, ratelimit=1.0, target_element_selector='example-id-0'):

    page_contents = {
        'pageTitle': None,
        'content': None
    }

    result = requests.get(url).text

    soup = BeautifulSoup(result, 'lxml')

    page_contents['pageTitle'] = soup.find('h1')

    page_contents['content'] = soup.find(id=target_element_selector)
    
    time.sleep(ratelimit)

    return page_contents


if __name__ == '__main__':
    
    url_list = [
        'https://example.com/page-1',
        'https://example.com/page-2',
        'https://example.com/page-3',
    ]

    for url in url_list:
            try:
                scraped = html_scraper(url, 0.5, 'example-id-1')

                scraped_html.append(scraped)

            except Exception as e:
                print(e)
    
     print(scraped_html)
     # [{'pageTitle': None, 'content': None}, {'pageTitle': None, 'content': None}, {'pageTitle': Example Page 3 Title, 'content': <div id="example-id-1">Blah-blah-blah-blah-blah</div>}]

Answer 1

@JohnGordon 的答案应归功于@JohnGordon。他认为问题可能是由

requests

模块引起的。于是，我把

requests.get(url).text

换成了

urllib.request.urlopen(url)

，问题就解决了。当 html_scraper 函数仅用于一页时，即不在 for 循环中，

request

模块可以工作，但在 for 循环中，它无法识别除最后一次迭代之外的所有内容。再次感谢@JohnGordon！

在for循环中调用的BeautifulSoup函数仅返回最后抓取的内容，忽略所有其他内容

问题描述投票：0回答：1

1个回答

最新问题

在for循环中调用的BeautifulSoup函数仅返回最后抓取的内容，忽略所有其他内容

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1