我已经搜索了一段时间,但找不到答案。
我正在尝试通过以下链接抓取维基百科上可用的“各州结果”表: https://en.wikipedia.org/wiki/2020_United_States_presidential_election#Results_by_state
到目前为止,当我运行代码时,我只能让它从页面上方的“乔·拜登 vs 唐纳德·特朗普”表中提取数据。
website = 'https://en.wikipedia.org/wiki/2020_United_States_presidential_election'
result = requests.get(website)
content = result.text
soup = BeautifulSoup(content, "html.parser")
tables = soup.find("table", class_="wikitable sortable")
for table in tables:
if 'Results by state' in table.text:
headers = [header.text.strip() for header in table.find_all('th')]
rows = []
table_rows = table.find_all('tr')
for row in table_rows:
td = row.find_all('td')
row = [row.text for row in td]
rows.append(row)
不确定,因为你的代码似乎可以工作。然而,抓取表格的最简单方法是使用
pandas.read_html()
并尝试匹配表格中的模式:
import pandas as pd
pd.read_html('https://en.wikipedia.org/wiki/2020_United_States_presidential_election#Results_by_state', match='Results by state')[0]
直接使用
BeautifulSoup
尝试选择更具体的表格,例如与 css selectors
:
tables = soup.select('table:has(caption:-soup-contains("Results by state"))')