如何显示标题标题和段落与原始网页的顺序相同? -Python

问题描述 投票:0回答:1

我是解析维基百科的网页。我想搜索一个关键字,例如“第一个抽象”,并显示匹配的标题,标题和段落,我该怎么办?。

Web :https://en.wikipedia.org/wiki/Mathematics
Search = "The first abstraction"
Output :
       tittle: Mathematics
       header: History
       paragraph : The history of mathematics can be seen as an ever-increasing series of   
                   abstractions. **The first abstraction**, which is shared by many animals,[14] was 
                   probably that of numbers: the realization that a collection of two apples and a            
                   collection of two oranges (for example) have something in common, namely quantity 
                   of their members. 
import bs4
import requests

response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')

title = html.select("#firstHeading")[0].text
print(title)
paragraphs = html.select("p")
for para in paragraphs:
    print (para.text)

# just grab the text up to contents as stated in question
intro = '\n'.join([ para.text for para in paragraphs[0:5]])
print (para.text)

此代码很好地显示了标题,但是标题和段落没有排序,因此我无法匹配它。Thx

python beautifulsoup wikipedia
1个回答
0
投票

首先,当您遍历

标记时,您需要搜索“ The first abstraction”,因为您只需要具有“ The first abstraction”的段落。

因此在'para'上添加find()方法以检查是否存在所需的文本-

paragraphs = html.select("p")

Search = "The first abstraction" # expected text

for para in paragraphs:
    px = para.text
    if px.find(Search)>-1:
        print (para.text)

这将使您期望的段落为-

The history of mathematics can be seen as an ever-increasing series of abstractions. The first abstraction, which is shared by many animals,[14] was probably that of numbers: the realization that a collection of two apples and a collection of two oranges (for example) have something in common, namely quantity of their members.

所以现在paragraphtitle完成了。您需要提取header。着重于您要解析的页面的html文件结构(这将总是有帮助的)。

在下图中,h2p标记(在此处找到您的文本)的同级标记。了解兄弟姐妹here

enter image description here

所以要遍历前一个兄弟姐妹,您应该在p标签上两次调用“ previous_sibling”。

由于h2p之前的同级2个标记,您可以通过-访问h2

(具有“ History”标头)
paragraphs = html.select("p")
for para in paragraphs:
    px = para.text
    if px.find(Search)>-1:
        print (para.text)
        print(para.previous_sibling.previous_sibling.previous_sibling.previous_sibling.text)

这将打印-

History
© www.soinside.com 2019 - 2024. All rights reserved.