如何显示标题标题和段落与原始网页的顺序相同？ -Python

Question

我是解析维基百科的网页。我想搜索一个关键字，例如“第一个抽象”，并显示匹配的标题，标题和段落，我该怎么办？。

Web :https://en.wikipedia.org/wiki/Mathematics
Search = "The first abstraction"
Output :
       tittle: Mathematics
       header: History
       paragraph : The history of mathematics can be seen as an ever-increasing series of   
                   abstractions. **The first abstraction**, which is shared by many animals,[14] was 
                   probably that of numbers: the realization that a collection of two apples and a            
                   collection of two oranges (for example) have something in common, namely quantity 
                   of their members. 
import bs4
import requests

response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')

title = html.select("#firstHeading")[0].text
print(title)
paragraphs = html.select("p")
for para in paragraphs:
    print (para.text)

# just grab the text up to contents as stated in question
intro = '\n'.join([ para.text for para in paragraphs[0:5]])
print (para.text)

此代码很好地显示了标题，但是标题和段落没有排序，因此我无法匹配它。Thx

Answer 1

首先，当您遍历

标记时，您需要搜索“ The first abstraction”，因为您只需要具有“ The first abstraction”的段落。

因此在'para'上添加find（）方法以检查是否存在所需的文本-

paragraphs = html.select("p")

Search = "The first abstraction" # expected text

for para in paragraphs:
    px = para.text
    if px.find(Search)>-1:
        print (para.text)

这将使您期望的段落为-

The history of mathematics can be seen as an ever-increasing series of abstractions. The first abstraction, which is shared by many animals,[14] was probably that of numbers: the realization that a collection of two apples and a collection of two oranges (for example) have something in common, namely quantity of their members.

所以现在paragraph和title完成了。您需要提取header。着重于您要解析的页面的html文件结构（这将总是有帮助的）。

在下图中，h2是p标记（在此处找到您的文本）的同级标记。了解兄弟姐妹here。

所以要遍历前一个兄弟姐妹，您应该在p标签上两次调用“ previous_sibling”。

由于h2是p之前的同级2个标记，您可以通过-访问h2

（具有“ History”标头）

paragraphs = html.select("p")
for para in paragraphs:
    px = para.text
    if px.find(Search)>-1:
        print (para.text)
        print(para.previous_sibling.previous_sibling.previous_sibling.previous_sibling.text)

这将打印-

History

如何显示标题标题和段落与原始网页的顺序相同？ -Python

问题描述投票：0回答：1

1个回答

最新问题

如何显示标题标题和段落与原始网页的顺序相同？ -Python

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1