我收到一个始终具有相同形状的 html 表格。只是每次的数值不同。
html = '''
<table align="center">
<tr>
<th>Name</th>
<td>NAME A</td>
<th>Status</th>
<td class="IN PROGRESS">IN PROGRESS</td>
</tr>
<tr>
<th>Category</th>
<td COLSPAN="3">CATEGORY A</td>
</tr>
<tr>
<th>Creation date</th>
<td>13/01/23 23:00</td>
<th>End date</th>
<td></td>
</tr>
</table>
'''
我需要将其转换为数据框,但 pandas 给了我一个奇怪的格式:
print(pd.read_html(html)[0])
0 1 2 3
0 Name NAME A Status IN PROGRESS
1 Category CATEGORY A CATEGORY A CATEGORY A
2 Creation date 13/01/23 23:00 End date NaN
我觉得我们需要使用 beautifulsoup 但我不知道如何:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
你们能帮我吗?
我的预期输出是这个数据框:
Name Category Status Creation date End date
0 NAME A CATEGORY A RUNNING 27/07/2023 11:43 NaN
您可以迭代
<td>
´s 并将它们与它一起存储在 dict
:
{e.find_previous_sibling('th').text:e.text for e in soup.select('table td')}
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<table align="center">
<tr>
<th>Name</th>
<td>NAME A</td>
<th>Status</th>
<td class="IN PROGRESS">IN PROGRESS</td>
</tr>
<tr>
<th>Category</th>
<td COLSPAN="3">CATEGORY A</td>
</tr>
<tr>
<th>Creation date</th>
<td>13/01/23 23:00</td>
<th>End date</th>
<td></td>
</tr>
</table>
'''
data = []
soup = BeautifulSoup(html)
pd.DataFrame(
[
{e.find_previous_sibling('th').text:e.text for e in soup.select('table td')}
]
)