Python 初学者:将从网站抓取的 HTML 文档转换为数据框

问题描述 投票:0回答:1

我正在尝试从 FBRef 网站上抓取足球运动员的数据,我从该网站获取了作为

bs4.element.ResultSet
对象的数据。

代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")

comp = re.compile("<!--|-->")
soup = BeautifulSoup(comp.sub("",res.text),'lxml')
all_data = soup.findAll("tbody")
    
player_data = all_data[2]

数据如下:

<tr><th class="right" **...** href="/en/players/774cf58b/Max-Aarons">Max Aarons</a></td><td **...** data-stat="position">DF</td><td class="left" data-stat="team"><a href="/en/squads/4ba7cbea/Bournemouth-Stats">Bournemouth</a></td><td class="center" data-stat="age">24-084</td><td class="center" data-stat="birth_year">2000</td><td**...** </a></td></tr>

<tr><th class="right" **...** href="/en/players/77816c91/Benie-Adama-Traore">Bénie Adama Traore</a></td><td **...** data-stat="position">FW,MF</td><td class="left" data-stat="team"><a href="/en/squads/1df6b87e/Sheffield-United-Stats">Sheffield Utd</a></td><td class="center" data-stat="age">21-119</td><td class="center" data-stat="birth_year">2002 **...** </a></td></tr>
**...**

I want to create a Pandas data frame from this such as:

姓名 职位 团队 年龄 出生年份 ...

马克斯·阿伦斯 DF 伯恩茅斯 24 2000

贝尼·阿达马·特拉奥雷 谢菲尔德联队前锋 21 2002 。 。 .


Thanks in advance

Looked similar questions here and dried to apply the solutions but couldn't make it work
python pandas beautifulsoup
1个回答
0
投票

要从抓取的数据创建 Pandas DataFrame,您可以迭代标签,从每个标签中提取相关信息,然后将其附加到列表中。最后,您可以使用该列表来创建 DataFrame。具体方法如下:

import requests
from bs4 import BeautifulSoup
import pandas as pd

res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")
soup = BeautifulSoup(res.text, 'lxml')

player_data = soup.find_all("tbody")[2]

data = []

for row in player_data.find_all("tr"):
    name = row.find("a").text
    position = row.find("td", {"data-stat": "position"}).text
    team = row.find("td", {"data-stat": "team"}).text
    age = row.find("td", {"data-stat": "age"}).text
    birth_year = row.find("td", {"data-stat": "birth_year"}).text
    
    data.append([name, position, team, age, birth_year])

df = pd.DataFrame(data, columns=['Name', 'Position', 'Team', 'Age', 'Birth Year'])
print(df)

此代码将从抓取的数据中创建一个包含“名称”、“职位”、“团队”、“年龄”和“出生年份”列的 DataFrame。

© www.soinside.com 2019 - 2024. All rights reserved.