我们可以使用XPath与BeautifulSoup？

Question

我使用BeautifulSoup凑一个网址，我有以下代码

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

现在，在上面的代码中我们可以使用findAll获得与之相关的标签和信息，但我想使用XPath。是否有可能使用XPath与BeautifulSoup？如果可能的话，任何人都可以请给我一个例子代码，以便它会更有帮助？

Answer 1

不，BeautifulSoup，其本身并不支持XPath表达式。

另一种库，lxml，不支持的XPath 1.0。它有一个BeautifulSoup compatible mode它会尝试解析HTML碎汤的方式做。然而，default lxml HTML parser做解析HTML破碎的一样好工作，我相信是更快。

一旦你解析你的文档转换成LXML树，你可以使用.xpath()方法来搜索元素。

import urllib2
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

可能您感兴趣的是CSS Selector support;在CSSSelector类转化CSS语句转换为XPath表达式，使您的td.empformbody容易得多搜索：

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

一圈下来：BeautifulSoup本身确实有非常完整CSS selector support：

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

Answer 2

我可以证实，有美丽的汤内没有XPath支持。

Answer 3

马亭的代码不再正常工作（这是4+岁到现在...），该etree.parse()行打印到控制台，并且不赋值给变量tree。引用this，我能弄清楚这个工程使用要求和LXML：

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print 'Buyers: ', buyers
print 'Prices: ', prices

Answer 4

BeautifulSoup有一个从定向子女，所以当前元素命名findNext功能：

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

上面的代码可以模仿以下XPath：

div[class=class_value]/div[id=id_value]

Answer 5

我已经通过自己的docs搜查，似乎没有XPath的选择。此外，你可以看到here上如此，OP是要求从XPath来BeautifulSoup一个翻译，所以我的结论是类似的问题 - 没有，没有的XPath解析可用。

Answer 6

当您使用lxml的所有简单：

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

但是，当使用BeautifulSoup BS4所有简单太：

首先删除“//”和“@”
第二 - 前“=”加星

试试这个法宝：

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

你看，这不支持子标签，所以我删除“/ @ HREF”部分

Answer 7

这是一个很老的线程，但现在有一个变通的解决方案，这可能不是一直在BeautifulSoup的时间。

下面是我做的一个例子。我用的是“请求”模块阅读RSS提要，并得到一个名为“rss_text”变量的文本内容。就这样，我运行它直通BeautifulSoup，搜索的XPath / RSS /渠道/标题，并检索其内容。它不完全的XPath在其所有的荣耀（通配符，多路径等），但如果你只是想找到一个基本路径，这个工程。

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

我们可以使用XPath与BeautifulSoup？

问题描述投票：84回答：7

7个回答

最新问题

我们可以使用XPath与BeautifulSoup？

问题描述 投票：84回答：7

7个回答

最新问题

问题描述投票：84回答：7