BeautifulSoup:在 两个标签之间的标签中查找所有内容>> [[标签 我有一个要解析的网页,来源如下: <div class=WordSection1> <p class=MsoTitle><a name="_nvfbzzqeywr7"></a><span lang=EN>Geometry</span></p> <h2><a name="_99n9742wmg4y"></a><span lang=EN>Algebraic Geometry </span></h2> <p class=MsoNormal><span lang=EN>It is a type of geometry which deals with zeros of multivariate polynomial. It consists of linear and polynomial algebraic equations used to solve the sets of zeros. The uses of this type <span class=GramE>consists</span> of Cryptography, String theory, etc.</span></p> <h2><a name="_64xtqrllvykm"></a><span lang=EN>Discrete Geometry</span></h2> <p class=MsoNormal><span lang=EN>It is a type of Geometry, mainly concerned with the relative position of simple geometric objects, such as points, lines, Triangles, Circles, etc.</span></p> <h2><a name="_mdul98ybu9wv"></a><span lang=EN>Differential Geometry</span></h2> <p class=MsoNormal><span lang=EN>It uses techniques of algebra and calculus for problem solving. The different problem involves general relativity in physics <span class=SpellE><span class=GramE>etc</span></span><span class=GramE>,.</span></span></p> </div> 我想以列表的形式解析内部数据,并在另一个列表中解析相应的标签,以便以后可以以python字典的形式映射它。我什至不得不忽略其他标签,例如h1和h3。 预期结果: headers = ['Algebraic Geometry','Discrete Geometry','Differential Geometry'] content = ['It is a type of geometry which deals with zeros of multivariate polynomial. It consists of linear and polynomial algebraic equations used to solve the sets of zeros. The uses of this type consists of Cryptography, String theory, etc.','It is a type of Geometry, mainly concerned with the relative position of simple geometric objects, such as points, lines, Triangles, Circles, etc.','It uses techniques of algebra and calculus for problem solving. The different problem involves general relativity in physics etc,.'] 我能够获取所有标头。但是要获得 标头中的标签,我无法获得结果。这是我尝试过的: content = [] # find the node with id of "WordSection1" mark = soup.find_all(class_="WordSection1") print(mark) # walk through the siblings of the parent (H2) node # until we reach the next H2 node for elt in mark.parent.nextSiblingGenerator(): if elt.name == "h2": break if hasattr(elt, "text"): content.append(elt.text) 请为此帮助我。 我有一个网页要解析,来源如下: [ Geometry ...

问题描述 投票:0回答:2
我有一个要解析的网页,来源如下:
python python-3.x web-scraping beautifulsoup
2个回答
0
投票
这应该做您所需要的

0
投票
我试图使用Xpath在Web Scraping Language中表达这一点>

GOTO targeturl.com EXTRACT {'header': '//h2/preceding-sibling::p[@class="MsoNormal"]', 'desc': '//p[@class="MsoNormal"]/following-sibling::h2'}

© www.soinside.com 2019 - 2024. All rights reserved.