Parsel 无法访问嵌套元素

问题描述 投票:0回答:1
python beautifulsoup scrapy lxml parsel
1个回答
0
投票

lxml.html
使用的
Parsel
解析器“修复”了HTML代码,并将内部的
<a>
放在外面。尝试在实例化
type="xml"
时指定
Selector
:

from parsel import Selector

html_text = """
<html>
    <head>
    <base href='http://example.com/' />
    <title>Example website</title>
    </head>
    <body>
    <a href="#">
        <a id="test" href='image1.html'>Name: My image 1 <br /></a>
        <a id="test" href='image2.html'>Name: My image 2 <br /></a>
        <a id="test" href='image3.html'>Name: My image 3 <br /></a>
        <a id="test" href='image4.html'>Name: My image 4 <br /></a>
        <a id="test" href='image5.html'>Name: My image 5 <br /></a>
    </a>
    </body>
    </html>
"""

selector = Selector(text=html_text, type="xml")
print(selector.xpath("//a/a"))

打印:

[
 <Selector query='//a/a' data='<a id="test" href="image1.html">Name:...'>,
 <Selector query='//a/a' data='<a id="test" href="image2.html">Name:...'>,
 <Selector query='//a/a' data='<a id="test" href="image3.html">Name:...'>,
 <Selector query='//a/a' data='<a id="test" href="image4.html">Name:...'>,
 <Selector query='//a/a' data='<a id="test" href="image5.html">Name:...'>
]
© www.soinside.com 2019 - 2024. All rights reserved.