BeautifulSoup HTML抓取,如何在正文中在广告之后排行

问题描述 投票:1回答:2

我对学习抓取网站很感兴趣。现在,我学习了如何在网站上抓取表格。我使用了BeautifulSoup。

我有一个简单的HTML表格可以解析,但是以某种方式,Beautifulsoup我试图在tbody中排成一行,但总是在“ thead”中得到单词。 。我想知道是否有人会看看这是怎么回事。因此,我已经从HTML表创建了rows对象:

<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;">
<thead>
    <tr role="row">
        <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th>
        <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th>
        <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th>
        <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th>
    </tr>
</thead>
<tbody>
    <tr role="row" class="odd">
        <td class="text-center">1</td>
        <td class="text-center">AALI</td>
        <td><a href="/perusahaan-tercatat/profil-perusahaan-tercatat/detail-profile-perusahaan-tercatat/?kodeEmiten=AALI">Astra Agro Lestari Tbk</a></td>
        <td>09 Des 1997</td>
    </tr>
    <tr role="row" class="even">
        <td class="text-center">2</td>
        <td class="text-center">ABBA</td>
        <td><a href="/perusahaan-tercatat/profil-perusahaan-tercatat/detail-profile-perusahaan-tercatat/?kodeEmiten=ABBA">Mahaka Media Tbk</a></td>
        <td>03 Apr 2002</td>
    </tr>

非常抱歉,我已经阅读并尝试过此Beautifulsoup HTML table parsing--only able to get the last row?。但仍然不要获取它..并在输出中获取“ []”。

这是我要抓取的链接。 :https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/

我想获得这一行。

<tr role="row" class="odd">
        <td class="text-center">1</td>
        <td class="text-center">AALI</td>
        <td><a href="/perusahaan-tercatat/profil-perusahaan-tercatat/detail-profile-perusahaan-tercatat/?kodeEmiten=AALI">Astra Agro Lestari Tbk</a></td>
        <td>09 Des 1997</td>
    </tr>

我尝试得到它,但是总是说出“ thead”中的单词。

这是我的代码:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = 'https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/'
uClient = uReq(url)
pageHtml = uClient.read()
uClient.close()
pageSoup = soup(pageHtml, "html.parser")
table = pageSoup.findAll('table', id = "companyTable")
table = table[0]
for row in table.findAll('tr'):
for cell in row.findAll('th'):
print(cell.text)
html python-3.x beautifulsoup
2个回答
0
投票

您只需要tr标记中的第一个tbody。所以我会用这个:

first_row = s.find('tbody').find('tr')

s是我的情况。这是一个例子:

>>> html = """<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;">
... <thead>
...     <tr role="row">
...         <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th>
...         <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th>
...         <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th>
...         <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th>
...     </tr>
... </thead>
... <tbody>
...     <tr role="row" class="odd">
...         <td class="text-center">1</td>
...         <td class="text-center">AALI</td>
...         <td><a href="/perusahaan-tercatat/profil-perusahaan-tercatat/detail-profile-perusahaan-tercatat/?kodeEmiten=AALI">Astra Agro Lestari Tbk</a></td>
...         <td>09 Des 1997</td>
...     </tr>
...     <tr role="row" class="even">
...         <td class="text-center">2</td>
...         <td class="text-center">ABBA</td>
...         <td><a href="/perusahaan-tercatat/profil-perusahaan-tercatat/detail-profile-perusahaan-tercatat/?kodeEmiten=ABBA">Mahaka Media Tbk</a></td>
...         <td>03 Apr 2002</td>
...     </tr>
... """
>>> s = BeautifulSoup(html)
>>> first_row = s.find('tbody').find('tr')
>>> first_row
<tr class="odd" role="row">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td><a href="/perusahaan-tercatat/profil-perusahaan-tercatat/detail-profile-perusahaan-tercatat/?kodeEmiten=AALI">Astra Agro Lestari Tbk</a></td>
<td>09 Des 1997</td>
</tr>

因为find仅返回与之匹配的第一个元素,所以可以工作>


0
投票

解决问题

© www.soinside.com 2019 - 2024. All rights reserved.