我怎样才能只提取文章正文的某些部分?

问题描述 投票:0回答:1

在我的text_scraper(page_soup)中,我意识到最后我会得到与我的文章无关的无关信息。什么是摆脱无关信息的一般方法?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re


# Initializing our dictionary
dictionary = {}

# Initializing our url key
url_key = 'url'
dictionary.setdefault(url_key, [])

# Initializing our text key
text_key = 'text'
dictionary.setdefault(text_key, [])

def text_scraper(page_soup):
    text_body = ''
    # Returns the text of p tags, we stopped it at -5 bc that's when the text is irrelevant to the article
    for p in page_soup.find_all('p'):
        text_body += p.text
    return(text_body)

def article_scraper(url):
    # Opening up the connection, grabbing the page
    uClient = uReq(url)
    page_html = uClient.read()
    uClient.close()

    # HTML parsing
    page_soup = soup(page_html, "html.parser")

    dictionary['url'].append(url)
    dictionary['text'] = text_scraper(page_soup)
    return dictionary

articles_zero = 'https://www.sfchronicle.com/news/bayarea/heatherknight/article/Special-education-teacher-a-prime-example-of-13560483.php'
article = article_scraper(articles_zero)
article
python-3.x beautifulsoup html-parsing
1个回答
0
投票

如果您只想要与文章相关的文本,您可以在text_scraper方法中调整指针并仅删除<p>中的<section>标记:

def text_scraper(page_soup):
    text_body = ''
    # Find only the text related to the article:
    article_section = page_soup.find('section',{'class':'body'})
    # Returns the text of p tags, we stopped it at -5 bc that's when the text is irrelevant to the article
    for p in article_section.find_all('p'):
        if p.previousSibling and p.previousSibling.name is not 'em':
            text_body += p.text
    return(text_body)

然后文章将返回页脚内没有文字(希瑟奈特是一个专栏作家[...]和他们的斗争。)

编辑:添加了对父母的测试,以避免最后一部分旧金山纪事报[...] Twitter:@hknightsf

© www.soinside.com 2019 - 2024. All rights reserved.