我想提取h4中的文本以及与h4相关的文本以及与它们相关的链接(带有xpath)

问题描述 投票:0回答:1

我想从给定的字符串中提取一些滴度,文本和链接。 python脚本是这样的:

from lxml.html import fromstring
import requests
import html.parser

url='''
<div class="topLinks">
<div class="hd left">
</div><div class="hd-middle middle">

        <h4>TTTTTTTTTTTTT</h4></div><div class="hd right"></div><div class="boxMiddle"><ul><li><a href="FullStory.aspx?gid=4&id=6516" title="1399/03/18" target="_blank">PPPPPPPPPPPPPPP<img class="new" src="images/new.png"></a></li><li><a href="http://register1.sanjesh.org/fanni99up" title="1399/03/11" target="_blank">CCCCCCCCCCCCC</a></li><li><a href="http://www6.sanjesh.org/download/fani99/FaniNote99.pdf" title="1399/03/11" target="_blank"> ZZZZZZZZ </a></li><li><a href="FullStory.aspx?gid=4&id=6509" title="1399/03/11" target="_blank">FFFFFF</a></li><li><a href="FullStory.aspx?gid=4&id=6498" title="1399/02/21" target="_blank">XXXXXXXXXXXXXX </a></li></ul></div><div class="boxBottom"></div></div>


<div class="topLinks"><div class="hd left_alter"></div><div class="hd-middle middle_alter">

<h4>CCCCCCCCCCCC</h4></div><div class="hd right_alter"></div><div class="boxMiddle_alter"><ul><li><a href="http://register1.sanjesh.org/rgempiactax99/" title="1399/03/18" target="_blank">GGGGGGGGGGGGGGGG <img class="new" src="images/new.png"></a></li><li><a href="FullStory.aspx?gid=11&id=6515" title="1399/03/18" target="_blank">FFFFFFFFF<img class="new" src="images/new.png"></a></li><li><a href="http://register2.sanjesh.org/RGKhanevadehConsult/" title="1399/03/12" target="_blank">HHHHHHHHH</a></li><li><a href="FullStory.aspx?gid=11&id=6512" title="1399/03/12" target="_blank">FFFFFFFF</a></li><li><a href="FullStory.aspx?gid=11&id=6505" title="1399/02/24" target="_blank">NNNNNNNNNNNNNNNNNNNNNNNNNN</a></li><li><a href="http://dl.sanjesh.org/NOETDownload/DownloadHandler.ashx?id=1271" title="1398/12/12" target="_blank">OOOOOOOOOOOO</a></li><li><a href="FullStory.aspx?gid=11&id=6480" title="1399/01/26" target="_blank">JJJJJJJ</a></li></ul></div><div class="boxBottom_alter"></div></div>

'''  

tree = fromstring(url)
titrs = tree.xpath("//div[@class='topLinks']")
for titr in titrs:
    print(titr);

texts = tree.xpath("//div[@class='topLinks']//a/text()")
for text in texts:
    print(text);
    links = tree.xpath("//div[@class='topLinks']//a/@href")
for link in links:
    print(link)

示例输出为:

python xpath href
1个回答
0
投票

严格来说,您需要以下XPath。 h4="TTTTTTTTTTTTT"的示例:

要检索文本:

//h4[.="TTTTTTTTTTTTT"]/following::div[@class="boxMiddle"]//text()

要检索链接:

//h4[.="TTTTTTTTTTTTT"]/following::div[@class="boxMiddle"]//@href

一个衬里:

(//text()[normalize-space()]|//@href)[preceding::h4[1][.="TTTTTTTTTTTTT"]]
© www.soinside.com 2019 - 2024. All rights reserved.