想象以下的HTML
<div class="group>
<ul class="smallList">
<li><strong>Date</strong>
some Date
</li>
<li>
<strong>Author</strong>
some Name
</li>
<li>
<strong>Keywords</strong>
<a href="linka"
rel="nofollow">keyworda</a>,
<a href="linkb"
rel="nofollow">Keywordb</a>,
</li>
<li>
<strong>Print</strong>
<a class="icon print" rel="nofollow" href="javascript:window.print()">print page</a>
</li>
</ul>
</div>
<div class="group>
<ul class="smallList">
<li><a href="linkc">Linktext</a></li>
</ul>
<div>
我正在寻找keyworda和keywordb。所以,只有包含Keywords的lsistelement内的词才会被搜索到。
我可以通过使用
.//div[@class='group']/ul[@class='smallList']/li/a/node()
但我如何只输入特定的一个?
我想你是想用xpath获取关键字条目。包含 函数可以在这里帮助你。我将使用 旁听生 库,只是因为它很容易使用。这一点在python中也可以用lxml或其他库来复制。
data = "[ur html above here]"
from parsel import Selector
sel = Selector(data)
#the path looks for the hyperlink and checks for two conditions:
#1. href contains link AND
#2. rel contains nofollow.
#after that access the text for this path
path = ".//a[contains(@href,'link') and contains(@rel,'nofollow')]/text()"
#extract text using getall() :
print(sel.xpath(path).getall())
['keyworda', 'Keywordb']