将BeautifulSoup汤转换为lxml元素

Question

我想使用BeautifulSoup或lxml解析某些网页。由于原始数据不是干净的xml，因此lxml.etree.fromstring无法直接对其进行解析。但是，Beautifulsoup(page_source,'lxml')有效，我可以获取页面的汤数据。因为我需要lxml中的某些功能，例如xpath的查询。我可以调用任何函数或变量来将整个原始网页的soup对象转换为etree对象吗？（我猜想Beautifulsoup在通过etree解析器生成soup对象之前，应该已经将原始页面转换为lxml对象，但是我找不到它在哪里存储对象。）

ps.s。我尝试了Is it possible to use bs4 soup object with lxml?的答案来解析网页。但是我仍然发现某些页面无法解析，这是示例：

>>> from urllib.request import urlopen
>>> html = urlopen('https://www.nature.com/articles/s41558-019-0619-1').read()

>>> soup = BeautifulSoup(html,'lxml')  ## return a soup object

>>> from lxml.etree import fromstring    
>>> fromstring(soup.prettify()) ## return errors

>>> from lxml.html.soupparser import fromstring    
>>> fromstring(soup.prettify()) ## return errors

Answer 1

我仍然不知道您想要哪个对象。但是通过使用请求库，我可以打印汤。这就是您想要的吗？

import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.nature.com/articles/s41558-019-0619-1')
soup = BeautifulSoup(html.content,'lxml')
print(soup.encode("utf-8"))

将BeautifulSoup汤转换为lxml元素

问题描述投票：1回答：1

1个回答

最新问题

将BeautifulSoup汤转换为lxml元素

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1