为什么不用 Py 来清理表格呢?

问题描述 投票:0回答:1

我想抓取两个表,但只得到第一个表的结果。 为什么?我对两个表使用相同的逻辑。

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL to scrape
url = "https://fbref.com/en/comps/9/keepers/Premier-League-Stats"

# Send a GET request
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Scrape the first table
table1 = soup.find_all('table', attrs={'id': 'stats_squads_keeper_for'})[0]
df1 = pd.read_html(str(table1))[0]

# Scrape the second table
table2 = soup.find_all('table', attrs={'id': 'stats_keeper'})[0]
df2 = pd.read_html(str(table2))[0]

# Print data frames
print(df1) # run well
print(df2) # is empty
python pandas web-scraping beautifulsoup
1个回答
0
投票

如果您不一定必须使用

bs
执行此操作,您也可以使用selenium。它也非常适合抓取动态内容。如果您不想没有
bs
,您还可以将硒与
bs
结合使用。

这是一个工作版本:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By

url = "https://fbref.com/en/comps/9/keepers/Premier-League-Stats"
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)

driver.get(url)

pd.set_option('display.width', None)

table1 = driver.find_element(By.ID, "stats_squads_keeper_for").get_attribute("outerHTML")
df1 = pd.read_html(str(table1))[0]
print(df1)

table2 = driver.find_element(By.ID, "stats_keeper").get_attribute("outerHTML")
df2 = pd.read_html(str(table2))[0]
print(df2)

with open("tab1.csv", "w", encoding="utf-8") as f:
    f.write(df1.to_csv())
with open("tab2.csv", "w", encoding="utf-8") as f:
    f.write(df2.to_csv())

driver.quit()
© www.soinside.com 2019 - 2024. All rights reserved.