我正在尝试使用请求模块和 BeautifulSoup 库从这个网页中抓取表格内容。
我已经设法从我在开发工具中找到的 URL 中获取包含具有一些乱码内容的表格的 HTML 元素,但是我仍然无法解析该表格的内容,因为它仍然充满了
\t
, \n
等等
我希望获得粗体字母文本,如每行中的
FEIJAO CARIOCA URBAN
所示。
from bs4 import BeautifulSoup
import requests
import re
start_url = 'http://nfe.sefaz.go.gov.br/nfeweb/sites/nfce/d/danfeNFCe?p=52240345543915002478650110004179799060499506|2|1|19|118.42|2b4937755a5a76797a522b6435534159784f5859427646356c7a4d3d|1|A763ED574AF1AECE3380D1E3A1EE188E3E95B414'
url = "http://nfe.sefaz.go.gov.br/nfeweb/sites/nfce/render/danfeNFCe?chNFe=52240345543915002478650110004179799060499506"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}
with requests.Session() as s:
s.headers.update(headers)
s.get(start_url)
resp = s.get(url)
soup = BeautifulSoup(resp.text,"lxml")
tabular_content = soup.select_one("script[type='text/javascript']:-soup-contains('new DanfeNFCe')").contents[0]
items = re.findall(r"\'(<div.*div>)\'",tabular_content)[0]
print(items)
表格一瞥更像是这样:
\t\t\t\t\t\t\t,\n \t\t\t\t\t\t\tGO\n <\/div>\n <\/div>\n <table border=\"0\" align=\"center\" cellpadding=\"0\" cellspacing=\"0\" id=\"tabResult\" data-filter=\"true\">\n <tr id=\"Item + 1\">\n <td valign=\"top\"><span class=\"txtTit\">FEIJAO CARIOCA URBAN<\/span><span class=\"RCod\">\n \t\t\t\t\t\t\t\t\t\t(C\u00F3digo:\n \t\t\t\t\t\t\t\t\t\t3352048\n \t\t\t\t\t\t\t\t\t\t)\n \t\t\t\t\t\t\t\t\t<\/span><br><span class=\"Rqtd\"><strong>Qtde.:<\/strong>2<\/span><span class=\"RUN\"><strong>UN: <\/strong>un<\/span><span class=\"RvlUnit\"><strong>Vl. Unit.:<\/strong>\n \t\t\t\t\t\t\t\t\t\t \n \t\t\t\t\t\t\t\t\t\t7,99<\/span><\/td>\n <td align=\"right\" valign=\"top\" class=\"txtTit noWrap\">\n \t\t\t\t\t\t\t\t\tVl. Total\n \t\t\t\t\t\t\t\t\t<br><span class=\"valor\">15,98<\/span><\/td>\n <\/tr>\n <tr id=\"Item + 2\">\n <td valign=\"top\"><span class=\"txtTit\">AZEITONA VDE S CAR<\/span><span class=\"RCod\">\n \t\t\t\t\t\t\t\t\t\t(C\u00F3digo:\n \t\t\t\t\t\t\t\t\t\t5881927\n \t\t\t\t\t\t\t\t\t\t)\n \t\t\t\t\t\t\t\t\t<\/span><br><span class=\"Rqtd\"><strong>Qtde.:<\/strong>1<\/span><span class=\"RUN\"><strong>UN: <\/strong>un<\/span><span class=\"RvlUnit\"><strong>Vl. Unit.:<\/strong>\n \t\t\t\t\t\t\t\t\t\t \n \t\t\t\t\t\t\t\t\t\t5,59<\/span><\/td>
好吧,我发现坚持我开始使用的方法取得了成功。不过,@Mark 在评论中建议的是更好的方法。
from bs4 import BeautifulSoup
import requests
import re
start_url = 'http://nfe.sefaz.go.gov.br/nfeweb/sites/nfce/d/danfeNFCe?p=52240345543915002478650110004179799060499506|2|1|19|118.42|2b4937755a5a76797a522b6435534159784f5859427646356c7a4d3d|1|A763ED574AF1AECE3380D1E3A1EE188E3E95B414'
url = "http://nfe.sefaz.go.gov.br/nfeweb/sites/nfce/render/danfeNFCe?chNFe=52240345543915002478650110004179799060499506"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}
with requests.Session() as s:
s.headers.update(headers)
s.get(start_url)
resp = s.get(url)
soup = BeautifulSoup(resp.text,"lxml")
tabular_content = soup.select_one("script[type='text/javascript']:-soup-contains('new DanfeNFCe')").contents[0]
items = re.findall(r"(<table.*table>)",tabular_content)[0]
cleaned_text = re.sub(r'\\r|\\n|\\', '', items)
soup = BeautifulSoup(cleaned_text,"html.parser")
for item in soup.select("table#tabResult > tr > td > span.txtTit"):
print(item.get_text(strip=True))
输出:
FEIJAO CARIOCA URBAN
AZEITONA VDE S CAR
REFRES EM PO MID MAR
MARGARINA CLAYBOM C
ARROZ TP1 GAROTINHO
REFRES PO FRISCO ABA
REFRES PO FRISCO MOR
REFRES PO FRISCO LIM
REFRES PO FRISCO LAR
REFRES PO FRISCO LAR
PAPEL HIG FD PERSO V
PAO FRANCES CARREF K
MEXERICA PONKAN CRFO
CER DEVASSA LT 269ML
LIMAO TAHITI CRFO KG