我一直在尝试通过网络抓取一个名为 GeeksForGeeks 的编码平台的排行榜。
给定的代码应该可以完美工作。但根本不起作用。
import requests
from bs4 import BeautifulSoup
try:
for page in range(1,3):
url = 'https://www.geeksforgeeks.org/colleges/lnct-university/students/?page='+str(page)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
# Find all user profile divs
user_profile_divs = soup.find_all('div', class_='UserCodingProfileCard_userCodingProfileCard__0GQCR')
for user_profile in user_profile_divs:
# Extract user details
user_name = user_profile.find('p', class_='UserCodingProfileCard_userCodingProfileCard_dataDiv_data--linkhandle__lZchE').text
practice_problem = user_profile.find('p', class_='UserCodingProfileCard_userCodingProfileCard_dataDiv_data--value__3A8Kx').text
coding_score = user_profile.find('p', class_='UserCodingProfileCard_userCodingProfileCard_dataDiv_data--value__3A8Kx').text
potd_streak = user_profile.find('p', class_='UserCodingProfileCard_userCodingProfileCard_dataDiv_data--value__3A8Kx').text
# Print the extracted information
print(f"User Name: {user_name}")
print(f"Practice Problem: {practice_problem}")
print(f"Coding Score: {coding_score}")
print(f"POTD Streak: {potd_streak}")
print("\n")
except Exception as e:
print(e)
问题在于,您在页面上看到的数据是从外部 URL 以 Json 形式加载的,因此 beautifulsoup 看不到它。
要从所有页面获取数据,您可以使用下一个示例:
import pandas as pd
import requests
api_url = "https://practiceapi.geeksforgeeks.org/api/v1/institute/9162/students/stats?page_size=10&page=1"
page, all_data = 1, []
while True:
print(f"Page {page}...")
data = requests.get(api_url).json()
all_data.extend(data["results"])
if len(all_data) >= data["count"]:
break
page += 1
df = pd.DataFrame(all_data)
print(df.head())
打印:
user_id handle coding_score total_problems_solved potd_longest_streak
0 5127866 rishav098kumar 2818 616 120
1 945492 Abhishek_Kumar_Verma 2540 1262 1
2 4592217 anushkasharma2317 2469 755 138
3 10388945 sj502pi26 2268 614 176
4 10874753 iascode9w7k 2142 580 183
...