lxml.html
使用的Parsel
解析器“修复”了HTML代码,并将内部的<a>
放在外面。尝试在实例化 type="xml"
时指定 Selector
:
from parsel import Selector
html_text = """
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<a href="#">
<a id="test" href='image1.html'>Name: My image 1 <br /></a>
<a id="test" href='image2.html'>Name: My image 2 <br /></a>
<a id="test" href='image3.html'>Name: My image 3 <br /></a>
<a id="test" href='image4.html'>Name: My image 4 <br /></a>
<a id="test" href='image5.html'>Name: My image 5 <br /></a>
</a>
</body>
</html>
"""
selector = Selector(text=html_text, type="xml")
print(selector.xpath("//a/a"))
打印:
[
<Selector query='//a/a' data='<a id="test" href="image1.html">Name:...'>,
<Selector query='//a/a' data='<a id="test" href="image2.html">Name:...'>,
<Selector query='//a/a' data='<a id="test" href="image3.html">Name:...'>,
<Selector query='//a/a' data='<a id="test" href="image4.html">Name:...'>,
<Selector query='//a/a' data='<a id="test" href="image5.html">Name:...'>
]