如何使用BeautifulSoup在Python中仅提取序列中的第n个HTML标题标签?

问题描述 投票:0回答:1

我正在尝试从Wikipedia表(https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award)中提取有关NBA历史上MVP冠军的数据。

这是我的代码:

wik_req = requests.get("https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award")
wik_webpage = wik_req.content
soup = BeautifulSoup(wik_webpage, "html.parser")

my_table = soup('table', {"class":"wikitable plainrowheaders sortable"})[0].find_all('a')
print(my_table)

for x in my_table:
  test = x.get("title")
  print(test)

但是,此代码将打印表的所有HTML标题标签,如下所示(简短版本):

'1955–56 NBA season
Bob Pettit
Power Forward (basketball)
United States
St. Louis Hawks
1956–57 NBA season
Bob Cousy
Point guard
Boston Celtics'

最终,我想创建一个熊猫数据框,在其中将所有赛季年份都存储在列中,将所有玩家年份都存储在列中,依此类推。仅打印其中一个HTML标签标题(例如仅NBA赛季)的技巧是什么?然后,我可以将它们存储到列中以设置我的数据框,并对球员,位置,国籍和球队进行相同操作。

python web-scraping beautifulsoup
1个回答
0
投票

该数据框所需的全部是:

import pandas as pd

url = "https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award"
df=pd.read_html(url)[5]

输出:

print(df)
     Season                  Player  ...    Nationality                       Team
0   1955–56             Bob Pettit*  ...  United States            St. Louis Hawks
1   1956–57              Bob Cousy*  ...  United States             Boston Celtics
2   1957–58           Bill Russell*  ...  United States         Boston Celtics (2)
3   1958–59         Bob Pettit* (2)  ...  United States        St. Louis Hawks (2)
4   1959–60       Wilt Chamberlain*  ...  United States      Philadelphia Warriors
..      ...                     ...  ...            ...                        ...
59  2014–15          Stephen Curry^  ...  United States  Golden State Warriors (2)
60  2015–16      Stephen Curry^ (2)  ...  United States  Golden State Warriors (3)
61  2016–17      Russell Westbrook^  ...  United States  Oklahoma City Thunder (2)
62  2017–18           James Harden^  ...  United States        Houston Rockets (4)
63  2018–19  Giannis Antetokounmpo^  ...         Greece        Milwaukee Bucks (4)
[64 rows x 5 columns]
© www.soinside.com 2019 - 2024. All rights reserved.