为什么我不能通过网络抓取我想要的桌子?

问题描述 投票:0回答:1

我是BeautifulSoup的新手,我想尝试一些网络抓取。对于我的小项目,我想从Wikipedia获得金州勇士队的胜率。我正计划将具有该信息的桌子摆成大熊猫,以便多年来可以对其进行绘图。但是,我的代码选择的是Table Key表而不是Seasons表。我知道这是因为它们是同一类型的表(wikitable),但我不知道如何解决此问题。我确信有一个简单的解释我很想念。有人可以解释一下如何修复我的代码,并解释一下将来如何选择要上网抓取的表格吗?谢谢!

c_data = "https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons" #wikipedia page
c_page = urllib.request.urlopen(c_data)
c_soup = BeautifulSoup(c_page, "lxml")
c_table=c_soup.find('table', class_='wikitable') #this is the problem
c_year = []
c_rate = []
for row in c_table.findAll('tr'): #setup for dataframe
  cells=row.findAll('td')
  if len(cells)==13:
    c_year = c_year.append(cells[0].find(text=True))
    c_rate = c_rate.append(cells[9].find(text=True))
print(c_year, c_rate)
python python-3.x dataframe beautifulsoup wikipedia
1个回答
1
投票

使用pd.read_html获取所有表

  • 此函数返回数据帧列表
import pandas as pd

# read tables
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons')

print(len(tables))
>>> 18

tables[0]
          0                                             1
0       AHC                  NBA All-Star Game Head Coach
1      AMVP            All-Star Game Most Valuable Player
2       COY                             Coach of the Year
3      DPOY                  Defensive Player of the Year
4    Finish          Final position in division standings
5        GB  Games behind first-place team in division[b]
6   Italics                            Season in progress
7    Losses               Number of regular season losses
8       EOY                         Executive of the Year
9      FMVP                   Finals Most Valuable Player
10      MVP                          Most Valuable Player
11      ROY                            Rookie of the Year
12      SIX                         Sixth Man of the Year
13     SPOR                           Sportsmanship Award
14     Wins                 Number of regular season wins

# display all dataframes in tables
for i, table in enumerate(tables):
    print(f'Table {i}')
    display(table)
    print('\n')

选择特定表

df_i_want = tables[x]  # x is the specified table, 0 indexed

# delete tables
del(tables)
© www.soinside.com 2019 - 2024. All rights reserved.