如何从含有乱码元素的表格的每一行中抓取标题?

问题描述 投票:0回答:1

我正在尝试使用请求模块和 BeautifulSoup 库从这个网页中抓取表格内容。

我已经设法从我在开发工具中找到的 URL 中获取包含具有一些乱码内容的表格的 HTML 元素,但是我仍然无法解析该表格的内容,因为它仍然充满了

\t
\n
等等

我希望获得粗体字母文本,如每行中的

FEIJAO CARIOCA URBAN
所示。

from bs4 import BeautifulSoup
import requests
import re

start_url = 'http://nfe.sefaz.go.gov.br/nfeweb/sites/nfce/d/danfeNFCe?p=52240345543915002478650110004179799060499506|2|1|19|118.42|2b4937755a5a76797a522b6435534159784f5859427646356c7a4d3d|1|A763ED574AF1AECE3380D1E3A1EE188E3E95B414'
url = "http://nfe.sefaz.go.gov.br/nfeweb/sites/nfce/render/danfeNFCe?chNFe=52240345543915002478650110004179799060499506"

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}
with requests.Session() as s:
    s.headers.update(headers)
    s.get(start_url)
    resp = s.get(url)
    soup = BeautifulSoup(resp.text,"lxml")
    tabular_content = soup.select_one("script[type='text/javascript']:-soup-contains('new DanfeNFCe')").contents[0]
    items = re.findall(r"\'(<div.*div>)\'",tabular_content)[0]
    print(items)

表格一瞥更像是这样:

\t\t\t\t\t\t\t,\n            \t\t\t\t\t\t\tGO\n         <\/div>\n      <\/div>\n      <table border=\"0\" align=\"center\" cellpadding=\"0\" cellspacing=\"0\" id=\"tabResult\" data-filter=\"true\">\n         <tr id=\"Item + 1\">\n            <td valign=\"top\"><span class=\"txtTit\">FEIJAO CARIOCA URBAN<\/span><span class=\"RCod\">\n                  \t\t\t\t\t\t\t\t\t\t(C\u00F3digo:\n                  \t\t\t\t\t\t\t\t\t\t3352048\n                  \t\t\t\t\t\t\t\t\t\t)\n                  \t\t\t\t\t\t\t\t\t<\/span><br><span class=\"Rqtd\"><strong>Qtde.:<\/strong>2<\/span><span class=\"RUN\"><strong>UN: <\/strong>un<\/span><span class=\"RvlUnit\"><strong>Vl. Unit.:<\/strong>\n                  \t\t\t\t\t\t\t\t\t\t&nbsp;\n                  \t\t\t\t\t\t\t\t\t\t7,99<\/span><\/td>\n            <td align=\"right\" valign=\"top\" class=\"txtTit noWrap\">\n               \t\t\t\t\t\t\t\t\tVl. Total\n               \t\t\t\t\t\t\t\t\t<br><span class=\"valor\">15,98<\/span><\/td>\n         <\/tr>\n         <tr id=\"Item + 2\">\n            <td valign=\"top\"><span class=\"txtTit\">AZEITONA VDE S CAR<\/span><span class=\"RCod\">\n                  \t\t\t\t\t\t\t\t\t\t(C\u00F3digo:\n                  \t\t\t\t\t\t\t\t\t\t5881927\n                  \t\t\t\t\t\t\t\t\t\t)\n                  \t\t\t\t\t\t\t\t\t<\/span><br><span class=\"Rqtd\"><strong>Qtde.:<\/strong>1<\/span><span class=\"RUN\"><strong>UN: <\/strong>un<\/span><span class=\"RvlUnit\"><strong>Vl. Unit.:<\/strong>\n                  \t\t\t\t\t\t\t\t\t\t&nbsp;\n                  \t\t\t\t\t\t\t\t\t\t5,59<\/span><\/td>
python python-3.x web-scraping beautifulsoup python-requests
1个回答
0
投票

好吧,我发现坚持我开始使用的方法取得了成功。不过,@Mark 在评论中建议的是更好的方法。

from bs4 import BeautifulSoup
import requests
import re

start_url = 'http://nfe.sefaz.go.gov.br/nfeweb/sites/nfce/d/danfeNFCe?p=52240345543915002478650110004179799060499506|2|1|19|118.42|2b4937755a5a76797a522b6435534159784f5859427646356c7a4d3d|1|A763ED574AF1AECE3380D1E3A1EE188E3E95B414'
url = "http://nfe.sefaz.go.gov.br/nfeweb/sites/nfce/render/danfeNFCe?chNFe=52240345543915002478650110004179799060499506"

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}
with requests.Session() as s:
    s.headers.update(headers)
    s.get(start_url)
    resp = s.get(url)
    soup = BeautifulSoup(resp.text,"lxml")
    tabular_content = soup.select_one("script[type='text/javascript']:-soup-contains('new DanfeNFCe')").contents[0]
    items = re.findall(r"(<table.*table>)",tabular_content)[0]
    cleaned_text = re.sub(r'\\r|\\n|\\', '', items)
    soup = BeautifulSoup(cleaned_text,"html.parser")
    for item in soup.select("table#tabResult > tr > td > span.txtTit"):
        print(item.get_text(strip=True))

输出:

FEIJAO CARIOCA URBAN
AZEITONA VDE S CAR
REFRES EM PO MID MAR
MARGARINA CLAYBOM C
ARROZ TP1 GAROTINHO
REFRES PO FRISCO ABA
REFRES PO FRISCO MOR
REFRES PO FRISCO LIM
REFRES PO FRISCO LAR
REFRES PO FRISCO LAR
PAPEL HIG FD PERSO V
PAO FRANCES CARREF K
MEXERICA PONKAN CRFO
CER DEVASSA LT 269ML
LIMAO TAHITI CRFO KG
© www.soinside.com 2019 - 2024. All rights reserved.