展平 Pandas read_html 中的数据，但提取链接

Question

我有一个如下所示的html表格：

html_text = "<table>
  <tr>
    <th>Home</th>
    <th>Score</th>
    <th>Away</th>
    <th>Report</th>
  </tr>
  <tr>
    <td>Arsenal</td>
    <td></td>
    <td>Manchester Utd</td>
    <td></td>
  </tr>
  <tr>
    <td>Everton</td>
    <td>2-0</td>
    <td>Liverpool</td>
    <td><a href="/asdasdasd/">Match Report</a></td>
  </tr>
</table>"

我通过请求加载它并使用 Panda 转换为数据集：

matches = pd.read_html(StringIO(str(html_text)), extract_links="all")[0]

现在给我以下信息：

(Home, None),(Score, None),(Away, None),(Report, None)
(Arsenal, None),(NaN, None),(Manchester Utd, None),(NaN, None)
(Everton, None),(2-0, None),(Liverpool, None),(Match Report, /asdasdasd/)

将数据集展平的最简单方法是什么（仅保留“报告”列中的链接和其他地方的文本）：

Home, Score, Away, Report
Arsenal, NaN, Manchester Utd, NaN
Everton, 2-0, Liverpool, /asdasdasd/

Answer 1

由于除

Report

之外的所有感兴趣的列都是元组，其中感兴趣的数据是第一个元素，因此从这些列中提取这些第一个元素。对于“报告”列，提取第二个元素。另外，使用

extract_links='body'

来防止从标头创建元组。

matches = pd.read_html(html_text, extract_links='body')[0]

for col in matches.columns.difference(['Report']):
    matches[col] = matches[col].apply(lambda x: x[0])
    
matches['Report'] = matches['Report'].apply(lambda x: x[1])

      Home Score            Away       Report
0  Arsenal        Manchester Utd         None
1  Everton   2-0       Liverpool  /asdasdasd/

展平 Pandas read_html 中的数据，但提取链接

问题描述投票：0回答：1

1个回答

最新问题

展平 Pandas read_html 中的数据，但提取链接

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1