如何解析固定形状的html表格?

问题描述 投票:0回答:1

我收到一个始终具有相同形状的 html 表格。只是每次的数值不同。

html = '''
<table align="center">
    <tr>
        <th>Name</th>
        <td>NAME A</td>
        <th>Status</th>
        <td class="IN PROGRESS">IN PROGRESS</td>
    </tr>
    <tr>
        <th>Category</th>
        <td COLSPAN="3">CATEGORY A</td>
    </tr>
    <tr>
        <th>Creation date</th>
        <td>13/01/23 23:00</td>
        <th>End date</th>
        <td></td>
    </tr>
</table>
'''

我需要将其转换为数据框,但 pandas 给了我一个奇怪的格式:

print(pd.read_html(html)[0])

               0               1           2            3
0           Name          NAME A      Status  IN PROGRESS
1       Category      CATEGORY A  CATEGORY A   CATEGORY A
2  Creation date  13/01/23 23:00    End date          NaN

我觉得我们需要使用 beautifulsoup 但我不知道如何:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

你们能帮我吗?

我的预期输出是这个数据框:

     Name    Category   Status     Creation date  End date
0  NAME A  CATEGORY A  RUNNING  27/07/2023 11:43       NaN
beautifulsoup html-table
1个回答
0
投票

您可以迭代

<td>
´s 并将它们与它一起存储在
dict
:

{e.find_previous_sibling('th').text:e.text for e in soup.select('table td')}
示例
from bs4 import BeautifulSoup
import pandas as pd

html = '''
<table align="center">
    <tr>
        <th>Name</th>
        <td>NAME A</td>
        <th>Status</th>
        <td class="IN PROGRESS">IN PROGRESS</td>
    </tr>
    <tr>
        <th>Category</th>
        <td COLSPAN="3">CATEGORY A</td>
    </tr>
    <tr>
        <th>Creation date</th>
        <td>13/01/23 23:00</td>
        <th>End date</th>
        <td></td>
    </tr>
</table>
'''

data = []
soup = BeautifulSoup(html)

pd.DataFrame(
    [
        {e.find_previous_sibling('th').text:e.text for e in soup.select('table td')}
    ]
)
© www.soinside.com 2019 - 2024. All rights reserved.