网络抓取行跨度大于 1 的表

Question

我想从 https://en.wikipedia.org/wiki/List_of_Eurovision_Song_Contest_host_cities 抓取第一个维基百科表格。困难在于表格合并了单元格（某些条目的行跨度大于 1）。

因此，例如“竞赛”列中的第一个条目是 9，适用于表的前 9 行（行跨度为 9），因此当抓取数据并将其添加到 pandas 数据框时，我想要竞赛列中的前 9 行包含条目“9”。

我尝试过以下方法：

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_Eurovision_Song_Contest_host_cities'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Create an empty DataFrame with desired column headers
df = pd.DataFrame(columns=['Contests', 'Country', 'City', 'Venue', 'Year', 'Ref'])

for index, row in enumerate(soup.find_all('tr')):
    if index == 0:  # Skip the first header row
        continue

    cells = row.find_all(['td', 'th'])
    
    country_value = None
    if cells[0].has_attr('rowspan'):
        contests_value = cells[0].get_text(strip=True)
        contests_rowspan = int(cells[0]['rowspan'])
        contests_values = [contests_value] * contests_rowspan # Replicate the value the required number of time
        df = df.append(pd.DataFrame({'Contests': contests_values}), ignore_index=True)

    if cells[1].has_attr('rowspan'):
        country_value = cells[1].get_text(strip=True)
        country_rowspan = int(cells[1]['rowspan'])
        country_values = [country_value] * country_rowspan
        df = df.append(pd.DataFrame({'Country': country_values}), ignore_index=True)

    if cells[2].has_attr('rowspan'):
        print(cells[2])
        city_value = cells[2].get_text(strip=True)
        city_rowspan = int(cells[2]['rowspan'])
        city_values = [city_value] * city_rowspan
        df = df.append(pd.DataFrame({'City': city_values}), ignore_index=True)
    
    venue_value = cells[3].get_text(strip=True)
    year_value = cells[4].get_text(strip=True)
    ref_value = cells[5].get_text(strip=True)
    
    for _ in range(max(contests_rowspan, country_rowspan, city_rowspan)):
            df = df.append({'Venue': venue_value, 'Year': year_value, 'Ref': ref_value}, ignore_index=True)

df.head()

问题在于，第一行中的 cells[0] 对应于竞赛，cells[1] 对应于国家/地区，cells[2] 对应于城市。但是，由于这 3 个条目的 rowspan 都大于 1，因此它们不包含在第二行 HTML 代码中，因此现在第二行中的 cells[0] 对应于 Venue，cells[1] 对应于 Year，cells[2 ] 至参考号。请注意，我的竞赛、国家/地区和城市的行跨度并不总是相同。

我不知道如何解决这个问题。

Answer 1

在这种情况下，看来你可以让

pd.read_html

为你做繁重的工作：

选项1：

pd.read_html

import pandas as pd

df = pd.read_html(url)[0] # selecting first table

df.head(2)

   Contests         Country    City                  Venue  Year Ref.
0         9  United Kingdom  London    Royal Festival Hall  1960  [1]
1         9  United Kingdom  London  BBC Television Centre  1963  [2]

选项2：

for loop

使用

for loop

，这可能是一种方法：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

url = 'https://en.wikipedia.org/wiki/List_of_Eurovision_Song_Contest_host_cities'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

table = soup.find('table')

cols = ['Contests', 'Country', 'City', 'Venue', 'Year', 'Ref.']

rows = []

for index, row in enumerate(table.find_all('tr')):
    if index == 0:
        continue
                         
    values = [cell.get_text(strip=True) for cell in row.find_all(['td', 'th'])]
    if len(values) != 6:
        values[:0] = [np.nan]*(6-len(values))
    
    rows.append(values)
    
df = pd.DataFrame(rows, columns=cols).ffill()

输出

df.head(2)

  Contests         Country    City                  Venue  Year Ref.
0        9  United Kingdom  London    Royal Festival Hall  1960  [1]
1        9  United Kingdom  London  BBC Television Centre  1963  [2]

# N.B. `pd.read_html` returns `Contests` with dtype `int64`, here you will get `object`.

解释

创建一个列表
```
rows
```
来收集所有行，用于循环后的
```
pd.DataFrame
```
。

_{（初始化一个空的}
```
df
```
然后连续向其中添加行非常昂贵；通常避免
```
df.append
```
：自 pd 1.4.0 以来已弃用。）。
在循环内，对
```
get_text
```
中的每个元素使用列表理解，并将其存储在变量
```
row.find_all(['td', 'th'])
```
中。
values
。如果
```
len(values) == 6
```
，我们将在列表的开头
```
缺少 
```
len(values) < 6
```
 值（逻辑是分层的）。因此，我们希望在前面添加尽可能多的 
```
(6-len(values)) 值，以便稍后我们可以转发填充。对于 NaN
```
 分配，参见。 
```
这篇文章
```
。
```
使用
values[:0]
```
 将 
```
values
```
 添加到 
```
rows
。
循环结束后，创建

list.append

，然后链接
```
df
```
以使用前一行中的最后一个有效值填充所有
```
df.ffill
```
值。

网络抓取行跨度大于 1 的表

问题描述投票：0回答：1

1个回答

最新问题

网络抓取行跨度大于 1 的表

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1