如何将抓取的 HTML 文档转换为数据框?

问题描述 投票:0回答:2

我正在尝试从 FBRef 网站上抓取足球运动员的数据,我从该网站获取了作为

bs4.element.ResultSet
对象的数据。

代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")

comp = re.compile("<!--|-->")
soup = BeautifulSoup(comp.sub("",res.text),'lxml')
all_data = soup.findAll("tbody")
    
player_data = all_data[2]

数据如下:

<tr><th class="right" **...** href="/en/players/774cf58b/Max-Aarons">Max Aarons</a></td><td **...** data-stat="position">DF</td><td class="left" data-stat="team"><a href="/en/squads/4ba7cbea/Bournemouth-Stats">Bournemouth</a></td><td class="center" data-stat="age">24-084</td><td class="center" data-stat="birth_year">2000</td><td**...** </a></td></tr>

<tr><th class="right" **...** href="/en/players/77816c91/Benie-Adama-Traore">Bénie Adama Traore</a></td><td **...** data-stat="position">FW,MF</td><td class="left" data-stat="team"><a href="/en/squads/1df6b87e/Sheffield-United-Stats">Sheffield Utd</a></td><td class="center" data-stat="age">21-119</td><td class="center" data-stat="birth_year">2002 **...** </a></td></tr>
**...**

我想从中创建一个 Pandas 数据框,例如:

**Name                Position    Team              Age      Birth Year** **...**

Max Aarons            DF          Bournemouth       24       2000

Benie Adama Traore    FW          Sheffield Utd     21       2002
**...**

在这里查看类似的问题并干燥以应用解决方案,但无法使其发挥作用

python pandas dataframe web-scraping beautifulsoup
2个回答
1
投票

要从抓取的数据创建 Pandas DataFrame,您可以迭代标签,从每个标签中提取相关信息,然后将其附加到列表中。最后,您可以使用该列表来创建 DataFrame。具体方法如下:

import requests
from bs4 import BeautifulSoup
import pandas as pd

res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")
soup = BeautifulSoup(res.text, 'lxml')

player_data = soup.find_all("tbody")[2]

data = []

for row in player_data.find_all("tr"):
    name = row.find("a").text
    position = row.find("td", {"data-stat": "position"}).text
    team = row.find("td", {"data-stat": "team"}).text
    age = row.find("td", {"data-stat": "age"}).text
    birth_year = row.find("td", {"data-stat": "birth_year"}).text
    
    data.append([name, position, team, age, birth_year])

df = pd.DataFrame(data, columns=['Name', 'Position', 'Team', 'Age', 'Birth Year'])
print(df)

此代码将从抓取的数据中创建一个包含“名称”、“职位”、“团队”、“年龄”和“出生年份”列的 DataFrame。


1
投票

我建议使用

pd.read_html
直接将HTML代码读取到dataframe:

import re
from io import StringIO

import pandas as pd
import requests

res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")

comp = re.compile("<!--|-->")
df = pd.read_html(StringIO(comp.sub("", res.text)))[2]  # <-- locate the right table

print(df)

打印:

    Unnamed: 0_level_0       Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Unnamed: 5_level_0 Unnamed: 6_level_0 Playing Time                     Performance                                        Expected                      Progression             Per 90 Minutes                                                               Unnamed: 36_level_0
                    Rk                   Player             Nation                Pos              Squad                Age               Born           MP  Starts   Min   90s         Gls  Ast  G+A  G-PK  PK  PKatt  CrdY  CrdR       xG  npxG  xAG  npxG+xAG        PrgC  PrgP  PrgR            Gls   Ast   G+A  G-PK  G+A-PK    xG   xAG  xG+xAG  npxG  npxG+xAG             Matches
0                    1               Max Aarons            eng ENG                 DF        Bournemouth             24-085               2000           14      12  1085  12.1           0    1    1     0   0      0     1     0      0.0   0.0  0.8       0.8          19    40    22           0.00  0.08  0.08  0.00    0.08  0.00  0.07    0.07  0.00      0.07             Matches
1                    2       Bénie Adama Traore             ci CIV              FW,MF      Sheffield Utd             21-120               2002            8       3   387   4.3           0    0    0     0   0      0     0     0      0.3   0.3  0.5       0.8           7     9    14           0.00  0.00  0.00  0.00    0.00  0.06  0.13    0.19  0.06      0.19             Matches
2                    3              Tyler Adams             us USA                 MF        Bournemouth             25-044               1999            1       0    20   0.2           0    0    0     0   0      0     0     0      0.0   0.0  0.0       0.0           0     1     0           0.00  0.00  0.00  0.00    0.00  0.00  0.00    0.00  0.00      0.00             Matches
3                    4         Tosin Adarabioyo            eng ENG                 DF             Fulham             26-187               1997           15      13  1173  13.0           1    0    1     1   0      0     1     0      0.6   0.6  0.1       0.6           5    39     3           0.08  0.00  0.08  0.08    0.08  0.04  0.01    0.05  0.04      0.05             Matches
4                    5           Elijah Adebayo            eng ENG                 FW         Luton Town             26-082               1998           23      13  1162  12.9           9    0    9     9   0      0     1     0      5.6   5.6  0.7       6.3          14    19    85           0.70  0.00  0.70  0.70    0.70  0.43  0.05    0.49  0.43      0.49             Matches
5                    6            Simon Adingra             ci CIV                 FW           Brighton             22-088               2002           21      16  1446  16.1           6    1    7     6   0      0     2     0      3.1   3.1  2.3       5.4          72    32   199           0.37  0.06  0.44  0.37    0.44  0.19  0.14    0.34  0.19      0.34             Matches

...
© www.soinside.com 2019 - 2024. All rights reserved.