我是解析维基百科的网页。我想搜索一个关键字,例如“第一个抽象”,并显示匹配的标题,标题和段落,我该怎么办?。
Web :https://en.wikipedia.org/wiki/Mathematics
Search = "The first abstraction"
Output :
tittle: Mathematics
header: History
paragraph : The history of mathematics can be seen as an ever-increasing series of
abstractions. **The first abstraction**, which is shared by many animals,[14] was
probably that of numbers: the realization that a collection of two apples and a
collection of two oranges (for example) have something in common, namely quantity
of their members.
import bs4
import requests
response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
title = html.select("#firstHeading")[0].text
print(title)
paragraphs = html.select("p")
for para in paragraphs:
print (para.text)
# just grab the text up to contents as stated in question
intro = '\n'.join([ para.text for para in paragraphs[0:5]])
print (para.text)
此代码很好地显示了标题,但是标题和段落没有排序,因此我无法匹配它。Thx
首先,当您遍历
标记时,您需要搜索“ The first abstraction”,因为您只需要具有“ The first abstraction”的段落。
因此在'para'上添加find()方法以检查是否存在所需的文本-
paragraphs = html.select("p")
Search = "The first abstraction" # expected text
for para in paragraphs:
px = para.text
if px.find(Search)>-1:
print (para.text)
这将使您期望的段落为-
The history of mathematics can be seen as an ever-increasing series of abstractions. The first abstraction, which is shared by many animals,[14] was probably that of numbers: the realization that a collection of two apples and a collection of two oranges (for example) have something in common, namely quantity of their members.
所以现在paragraph和title完成了。您需要提取header。着重于您要解析的页面的html文件结构(这将总是有帮助的)。
在下图中,h2是p标记(在此处找到您的文本)的同级标记。了解兄弟姐妹here。
所以要遍历前一个兄弟姐妹,您应该在p标签上两次调用“ previous_sibling”。
由于h2是p之前的同级2个标记,您可以通过-访问h2
(具有“ History”标头)paragraphs = html.select("p")
for para in paragraphs:
px = para.text
if px.find(Search)>-1:
print (para.text)
print(para.previous_sibling.previous_sibling.previous_sibling.previous_sibling.text)
这将打印-
History