因此,我很感兴趣地编写了一些代码,可以抓取 Clash Royale 网站,根据游戏中的不同部落提取信息,然后将其编译到 Excel 电子表格中。我希望加大赌注,从上述部落内的每个玩家那里获取更多信息,并显示他们的战争统计数据。
我现在拥有的或多或少是一个脚本,可以抓取每个部落,并输出每个部落的每个成员的 Excel 电子表格,并相应地对其进行分类。
我的问题在于收集他们的战争统计数据,因为这需要 selenium 在网站上加载 javascript(顺便说一句,我对 PYTHON 非常陌生,所以即使让它工作也很头痛,哈哈)。幸运的是,我似乎离主要目标并不遥远,但是,我真的对如何让这一切正常工作并以我想要的方式显示感到惊讶。
所以我现在有 2 个脚本:
在第一个脚本中,我们将其称为“clan_data_scraper.py”,我已经抓取了部落数据;显示每个氏族内的所有成员。 关于第二个脚本,我们将其称为“player_war_scraper.py”,我已经制作了它,以便命令行请求player_tag输入来显示他们的战争统计数据。 player_tag 是 Clash Royale 中分配给每个个人帐户的唯一标识符。
我想做什么 我该如何将这两个脚本链接在一起?所以我想要做的是,当我运行 clan_data_scraper.py 脚本时,我希望它生成所有的部落信息、玩家等。基本上我已经拥有了输出另外 7 个额外的 excel 文件,标记为“Aftermath” ” “Aftershock” “Afterlife” “Afterglow” “Afterparty” “Afterburn” 和 “Aftertaste”,几乎所有的氏族名称,例如,在 Aftermath excel 文件中,我希望player_war_scraper.py脚本也在 clan_data_scraper.py 脚本中实现,这样我就可以在每个工作表中拥有所有相关的玩家战争统计数据,并在前面提到的战队 Aftermath 中为每个工作表标上他们的名称和玩家标签。我知道“Aftermath”文件最多有 50 张,因为这是一个部落可以拥有的最大玩家数量,但是,我希望它根据部落中的确切玩家数量来生成。
希望这是有道理的,下面是我创建的 2 个脚本,如果您希望我详细说明一些事情,请提及。顺便说一下,其中 90% 是使用 ChatGPT 生成的,抱歉 🙃
第一个脚本:clan_data_scraper.py
import re
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from openpyxl import Workbook
from openpyxl.styles import Font, Alignment, PatternFill, Border, Side
from openpyxl.drawing.image import Image as xlImage # Rename Image to xlImage
from datetime import datetime
# List of clan URLs with corresponding names
clan_urls = [
('EXAMPLE URL', 'Aftermath'),
('EXAMPLE URL', 'Aftershock'),
('EXAMPLE URL', 'Afterlife'),
('EXAMPLE URL', 'Afterglow'),
('EXAMPLE URL', 'Afterparty'),
('EXAMPLE URL', 'Aftertaste'),
('EXAMPLE URL', 'Afterburn'),
# Add more clan URLs as needed
]
# Define column widths
column_widths = {'A': 13, 'B': 25, 'C': 5, 'D': 8, 'E': 20, 'G': 13, 'H': 20}
# Create a new workbook
wb = Workbook()
# Load the image
img_path = "C:\\Users\\mmoor\\Desktop\\Python Script\\button.png"
async def fetch_clan_data(session, url, clan_name):
async with session.get(url) as response:
content = await response.text()
return clan_name, content
async def fetch_all_clan_data():
async with aiohttp.ClientSession() as session:
tasks = [fetch_clan_data(session, url, clan_name) for url, clan_name in clan_urls]
return await asyncio.gather(*tasks)
async def main():
clan_data = await fetch_all_clan_data()
# Iterate over each clan's data
for clan_name, content in clan_data:
soup = BeautifulSoup(content, 'lxml')
# Find the divs containing headers and player rows
headers_div = soup.find('div', class_='clan__headers')
rows_divs = soup.find_all('div', class_='clan__rowContainer')
# Extract header names
headers = ['Player Tag', 'Name', 'Level', 'Trophies', 'Player Page']
# Extract player data for Name, Level, Trophies, and Player Page
players_data = []
for row_div in rows_divs:
player_data = [data.get_text(strip=True) for index, data in enumerate(row_div.find_all('div')) if index in [1, 2]]
trophies = row_div.find('div', class_='clan__cup')
player_data.append(trophies.get_text(strip=True) if trophies else '') # If trophies exist, append them, otherwise append an empty string
# Extract player tag from the profile link
profile_link = row_div.find('a', class_='ui__blueLink')['href']
player_tag = re.findall(r'[\w]+$', profile_link)[0] # Extract player tag from the href link
player_data.insert(0, player_tag) # Insert player tag
# Construct the royaleapi.com player page link
player_page_link = f'https://royaleapi.com/player/{player_tag}'
player_data.append(player_page_link) # Append player page link
# Replace emojis with spaces in the name using regular expression
player_data[1] = re.sub(r'[^\w\s]', ' ', player_data[1])
players_data.append(player_data)
# Create a new worksheet for the current clan with the clan name
ws = wb.create_sheet(title=clan_name)
# Write data to the worksheet
ws.append(headers)
for idx, player in enumerate(players_data, start=2): # Start from the second row to account for headers
# Apply formatting to Player Page column
player[-1] = f'=HYPERLINK("{player[-1]}", "View on RoyaleAPI")'
# Write row to worksheet
ws.append(player)
# Apply bold formatting to headers
for row in ws.iter_rows(min_row=1, max_row=1):
for cell in row:
cell.font = Font(size=9)
cell.alignment = Alignment(horizontal='center', vertical='center')
# Apply formatting to "View on RoyaleAPI" hyperlink text
for row in ws.iter_rows(min_row=2, max_row=ws.max_row, min_col=5, max_col=5):
for cell in row:
cell.font = Font(color='31869B', underline='single')
# Merge Player Page header
ws.merge_cells('E1:E1')
ws['E1'].alignment = Alignment(horizontal='center', vertical='center')
# Adjust column widths
for col, width in column_widths.items():
ws.column_dimensions[col].width = width
# Align all cells in column E to middle and center
for cell in ws['E']:
cell.alignment = Alignment(horizontal='center', vertical='center')
# Align all cells in column C and D to middle and center
for col in ['C', 'D']:
for cell in ws[col]:
cell.alignment = Alignment(horizontal='center', vertical='center')
# Apply alternating row fill colors
for i, row in enumerate(ws.iter_rows(min_row=2, max_row=ws.max_row, min_col=1, max_col=5), start=2):
if i % 2 == 0:
fill_color = 'FFFFFF' # White color
else:
fill_color = 'D9D9D9' # Light gray color
for cell in row:
cell.fill = PatternFill(start_color=fill_color, end_color=fill_color, fill_type='solid')
# Add borders to all cells with color #BFBFBF
border = Border(left=Side(border_style='thin', color='BFBFBF'),
right=Side(border_style='thin', color='BFBFBF'),
top=Side(border_style='thin', color='BFBFBF'),
bottom=Side(border_style='thin', color='BFBFBF'))
for row in ws.iter_rows():
for cell in row:
cell.border = border
# Display clan information
members_count = len(players_data)
last_updated = datetime.now().strftime("%d/%m/%Y | %H:%M:%S") # Corrected date and time format
ws['G2'] = f'Clan: '
ws['H2'] = f'{clan_name}'
ws['G3'] = f'Members: '
ws['H3'] = f'{members_count}'
ws['G4'] = f'Last Updated: '
ws['H4'] = f'{last_updated}'
# Apply formatting to clan information
info_cells = ['G2', 'H2', 'G3', 'H3', 'G4', 'H4']
for cell in info_cells:
ws[cell].font = Font(size=11)
ws[cell].alignment = Alignment(horizontal='right', vertical='center')
# Apply fill colors to clan information
ws['G2'].fill = PatternFill(start_color='FFFFFF', end_color='FFFFFF', fill_type='solid') # White fill
ws['H2'].fill = PatternFill(start_color='FFFFFF', end_color='FFFFFF', fill_type='solid') # White fill
ws['G3'].fill = PatternFill(start_color='D9D9D9', end_color='D9D9D9', fill_type='solid') # Light gray fill
ws['H3'].fill = PatternFill(start_color='D9D9D9', end_color='D9D9D9', fill_type='solid') # Light gray fill
ws['G4'].fill = PatternFill(start_color='FFFFFF', end_color='FFFFFF', fill_type='solid') # White fill
ws['H4'].fill = PatternFill(start_color='FFFFFF', end_color='FFFFFF', fill_type='solid') # White fill
# Apply borders to clan information
for cell in info_cells:
ws[cell].border = border
# Align text in Column G to the right and text in Column H to the left within the clan information
for row in ws.iter_rows(min_row=2, max_row=4, min_col=7, max_col=8):
for cell in row:
if cell.column == 7: # Column G
cell.alignment = Alignment(horizontal='right', vertical='center')
else: # Column H
cell.alignment = Alignment(horizontal='left', vertical='center')
# Run the asyncio event loop
asyncio.run(main())
# Remove the default sheet
default_sheet = wb['Sheet']
wb.remove(default_sheet)
# Save the workbook
wb.save('clan_data.xlsx')
print("Excel file generated successfully.")
第二个脚本:player_war_scraper.py
import sys
import time
import io
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Placeholder value for the number of rows of information to generate
num_rows = 12
def scrape_clan_wars_info(player_tag):
# Construct the URL with the player tag variable
url = f"EXAMPLE URL"
# Configure Chrome options for headless mode and disable image loading
chrome_options = Options()
chrome_options.add_argument("--headless") # Run Chrome in headless mode
chrome_options.add_argument("--disable-gpu") # Disable GPU acceleration
chrome_options.add_argument("--disable-infobars") # Disable info bars
chrome_options.add_argument("--disable-dev-shm-usage") # Disable /dev/shm usage
chrome_options.add_argument("--no-sandbox") # Disable sandbox mode
chrome_options.add_argument("--disable-extensions") # Disable extensions
chrome_options.add_argument("--disable-browser-side-navigation") # Disable browser side navigation
chrome_options.add_argument("--disable-features=VizDisplayCompositor") # Disable viz display compositor
chrome_options.add_argument("--start-maximized") # Start maximized
chrome_options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2}) # Disable image loading
# Initialize Chrome WebDriver with configured options
driver = webdriver.Chrome(options=chrome_options)
# Set implicit wait time
driver.implicitly_wait(10)
# Open the webpage
driver.get(url)
try:
# Switch to the iframe
driver.switch_to.frame("sp_message_iframe_1104950")
# Switch to the document within the iframe
iframe_document = driver.find_element(By.TAG_NAME, "html")
driver.switch_to.frame(iframe_document)
# Locate the "Accept" button within the dropdown
accept_button = driver.find_element(By.XPATH, "//button[contains(text(), 'Accept')]")
# Click the "Accept" button
accept_button.click()
# Switch back to the default content
driver.switch_to.default_content()
# Click the button using JavaScript
load_button = driver.find_element(By.CLASS_NAME, "cw2_history_button")
driver.execute_script("arguments[0].click();", load_button)
# Wait for the table to be loaded
table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "player__cw2_history_table")))
# Get the table headers with their respective data content
headers = [th.get_attribute("data-content") for th in table.find_elements(By.TAG_NAME, "th")]
# Specify the columns to select
selected_columns = ['Season ID', 'Date', 'Clan Name', 'Decks Used', 'Fame']
# Get the table data
table_html = table.get_attribute('outerHTML')
# Read the HTML into a DataFrame
df = pd.read_html(table_html)[0]
# Rename the columns with their respective data content
df.columns = headers
# Select only the specified columns
df_selected_columns = df[selected_columns]
# Select only the specified number of rows
df_selected_rows = df_selected_columns.head(num_rows)
# Export the DataFrame to an Excel file
file_name = f"Clan_Wars_2_History_Selected_{player_tag}.xlsx"
df_selected_rows.to_excel(file_name, index=False)
print(f"Selected {num_rows} rows of Clan Wars 2 history with specified columns successfully exported to {file_name}")
finally:
# Close the WebDriver
driver.quit()
# Check if player tag is provided as command-line argument
if len(sys.argv) != 2:
print("Please provide the player tag:")
sys.exit(1)
player_tag = sys.argv[1]
scrape_clan_wars_info(player_tag)
正如我之前所说,这些脚本 90% 是由 ChatGPT 生成的,我知道这段代码是否可以缩短,或者是否有更好的方法输出某些内容等。但目前,我正在寻求解决这个问题我有。
我已经向 ChatGPT 询问了同样的事情,但是他们将每个玩家标签作为单独的 Excel 文件输出,甚至没有根据部落等进行排序。
要使用 Selenium 提取战争统计数据,首先确保您的目标是 JavaScript 运行后加载的正确元素。使用 Selenium 的
WebDriverWait
和预期条件来等待这些元素出现。然后,提取数据并将其附加到现有的 Excel 文件结构中。由于您是 Python 新手,请从基本的 Selenium 命令(如 find_element_by_id
或 find_element_by_xpath
)开始来查找所需的数据。对于Excel部分,pandas
可以非常方便地管理和导出数据。把问题分解成更小的任务,然后一项一项地解决。快乐刮擦!