为什么我在抓取网站时得到一个空列表?

问题描述 投票:0回答:1
url = 'https://inshorts.com/en/read/technology'
news_data = []
news_category = url.split('/')[-1]

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,     like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
data = requests.get(url, headers=headers)

if data.status_code == 200:
    soup = BeautifulSoup(data.content, 'html.parser')

    headlines = soup.find('div', class_=['news-card-title', 'news-right-box'])
    articles = soup.find('div', class_=['news-card-content', 'news-right-box'])

    if headlines and articles and len(headlines) == len(articles):
        news_articles = [
            {
                'news_headline': headline.find_all('span', attrs={'itemprop': 'headline'}).string,
                'news_article': article.find_all('div', attrs={'itemprop': 'articleBody'}).string,
                'news_category': news_category
            }
            for headline, article in zip(headlines, articles)
        ]
        news_data.extend(news_articles)

print(news_data)

上面的代码尝试从 inshorts 网站上抓取数据并将其分为 3 类,即

news_headline
news_article
news_category

python web-scraping beautifulsoup python-requests nlp
1个回答
0
投票

由于您的条件,您得到空列表,它失败,因为

headlines / articles
None
,这意味着您的选择器找不到。

尝试更具体地选择元素,避免压缩多个列表并一次性获取信息 - 我在这里使用了

css selectors
,但你也可以使用
find() / find_all()

选择所有文章元素,迭代它们并为每个元素选择信息:

...
if data.status_code == 200:
    soup = BeautifulSoup(data.content)

    for article in soup.select('[itemtype="http://schema.org/NewsArticle"]'):
        news_data.append(
            {
                'news_headline': soup.select_one('[itemprop="headline"]').get_text(),
                'news_article': soup.select_one('[itemprop="articleBody"]').get_text(),
                'news_category': news_category
            }
        )

print(news_data)
© www.soinside.com 2019 - 2024. All rights reserved.