Python/网页抓取 |如何使用selenium从不同网站获取信息并编译excel文件?

问题描述 投票:0回答:1

因此,我很感兴趣地编写了一些代码,可以抓取 Clash Royale 网站,根据游戏中的不同部落提取信息,然后将其编译到 Excel 电子表格中。我希望加大赌注,从上述部落内的每个玩家那里获取更多信息,并显示他们的战争统计数据。

我现在拥有的或多或少是一个脚本,可以抓取每个部落,并输出每个部落的每个成员的 Excel 电子表格,并相应地对其进行分类。

我的问题在于收集他们的战争统计数据,因为这需要 selenium 在网站上加载 javascript(顺便说一句,我对 PYTHON 非常陌生,所以即使让它工作也很头痛,哈哈)。幸运的是,我似乎离主要目标并不遥远,但是,我真的对如何让这一切正常工作并以我想要的方式显示感到惊讶。

所以我现在有 2 个脚本:

在第一个脚本中,我们将其称为“clan_data_scraper.py”,我已经抓取了部落数据;显示每个氏族内的所有成员。 关于第二个脚本,我们将其称为“player_war_scraper.py”,我已经制作了它,以便命令行请求player_tag输入来显示他们的战争统计数据。 player_tag 是 Clash Royale 中分配给每个个人帐户的唯一标识符。

我想做什么 我该如何将这两个脚本链接在一起?所以我想要做的是,当我运行 clan_data_scraper.py 脚本时,我希望它生成所有的部落信息、玩家等。基本上我已经拥有了输出另外 7 个额外的 excel 文件,标记为“Aftermath” ” “Aftershock” “Afterlife” “Afterglow” “Afterparty” “Afterburn” 和 “Aftertaste”,几乎所有的氏族名称,例如,在 Aftermath excel 文件中,我希望player_war_scraper.py脚本也在 clan_data_scraper.py 脚本中实现,这样我就可以在每个工作表中拥有所有相关的玩家战争统计数据,并在前面提到的战队 Aftermath 中为每个工作表标上他们的名称和玩家标签。我知道“Aftermath”文件最多有 50 张,因为这是一个部落可以拥有的最大玩家数量,但是,我希望它根据部落中的确切玩家数量来生成。

希望这是有道理的,下面是我创建的 2 个脚本,如果您希望我详细说明一些事情,请提及。顺便说一下,其中 90% 是使用 ChatGPT 生成的,抱歉 🙃

第一个脚本:clan_data_scraper.py

import re
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from openpyxl import Workbook
from openpyxl.styles import Font, Alignment, PatternFill, Border, Side
from openpyxl.drawing.image import Image as xlImage  # Rename Image to xlImage
from datetime import datetime

# List of clan URLs with corresponding names
clan_urls = [
    ('EXAMPLE URL', 'Aftermath'),
    ('EXAMPLE URL', 'Aftershock'),
    ('EXAMPLE URL', 'Afterlife'),
    ('EXAMPLE URL', 'Afterglow'),
    ('EXAMPLE URL', 'Afterparty'),
    ('EXAMPLE URL', 'Aftertaste'),
    ('EXAMPLE URL', 'Afterburn'),
    # Add more clan URLs as needed
]

# Define column widths
column_widths = {'A': 13, 'B': 25, 'C': 5, 'D': 8, 'E': 20, 'G': 13, 'H': 20}

# Create a new workbook
wb = Workbook()

# Load the image
img_path = "C:\\Users\\mmoor\\Desktop\\Python Script\\button.png"

async def fetch_clan_data(session, url, clan_name):
    async with session.get(url) as response:
        content = await response.text()
        return clan_name, content

async def fetch_all_clan_data():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_clan_data(session, url, clan_name) for url, clan_name in clan_urls]
        return await asyncio.gather(*tasks)

async def main():
    clan_data = await fetch_all_clan_data()

    # Iterate over each clan's data
    for clan_name, content in clan_data:
        soup = BeautifulSoup(content, 'lxml')

        # Find the divs containing headers and player rows
        headers_div = soup.find('div', class_='clan__headers')
        rows_divs = soup.find_all('div', class_='clan__rowContainer')

        # Extract header names
        headers = ['Player Tag', 'Name', 'Level', 'Trophies', 'Player Page']

        # Extract player data for Name, Level, Trophies, and Player Page
        players_data = []
        for row_div in rows_divs:
            player_data = [data.get_text(strip=True) for index, data in enumerate(row_div.find_all('div')) if index in [1, 2]]
            trophies = row_div.find('div', class_='clan__cup')
            player_data.append(trophies.get_text(strip=True) if trophies else '')  # If trophies exist, append them, otherwise append an empty string
            # Extract player tag from the profile link
            profile_link = row_div.find('a', class_='ui__blueLink')['href']
            player_tag = re.findall(r'[\w]+$', profile_link)[0]  # Extract player tag from the href link
            player_data.insert(0, player_tag)  # Insert player tag
            # Construct the royaleapi.com player page link
            player_page_link = f'https://royaleapi.com/player/{player_tag}'
            player_data.append(player_page_link)  # Append player page link
            
            # Replace emojis with spaces in the name using regular expression
            player_data[1] = re.sub(r'[^\w\s]', ' ', player_data[1])
            
            players_data.append(player_data)

        # Create a new worksheet for the current clan with the clan name
        ws = wb.create_sheet(title=clan_name)

        # Write data to the worksheet
        ws.append(headers)
        for idx, player in enumerate(players_data, start=2):  # Start from the second row to account for headers
            # Apply formatting to Player Page column
            player[-1] = f'=HYPERLINK("{player[-1]}", "View on RoyaleAPI")'
            
            # Write row to worksheet
            ws.append(player)

        # Apply bold formatting to headers
        for row in ws.iter_rows(min_row=1, max_row=1):
            for cell in row:
                cell.font = Font(size=9)
                cell.alignment = Alignment(horizontal='center', vertical='center')

        # Apply formatting to "View on RoyaleAPI" hyperlink text
        for row in ws.iter_rows(min_row=2, max_row=ws.max_row, min_col=5, max_col=5):
            for cell in row:
                cell.font = Font(color='31869B', underline='single')

        # Merge Player Page header
        ws.merge_cells('E1:E1')
        ws['E1'].alignment = Alignment(horizontal='center', vertical='center')

        # Adjust column widths
        for col, width in column_widths.items():
            ws.column_dimensions[col].width = width

        # Align all cells in column E to middle and center
        for cell in ws['E']:
            cell.alignment = Alignment(horizontal='center', vertical='center')

        # Align all cells in column C and D to middle and center
        for col in ['C', 'D']:
            for cell in ws[col]:
                cell.alignment = Alignment(horizontal='center', vertical='center')

        # Apply alternating row fill colors
        for i, row in enumerate(ws.iter_rows(min_row=2, max_row=ws.max_row, min_col=1, max_col=5), start=2):
            if i % 2 == 0:
                fill_color = 'FFFFFF'  # White color
            else:
                fill_color = 'D9D9D9'  # Light gray color
            for cell in row:
                cell.fill = PatternFill(start_color=fill_color, end_color=fill_color, fill_type='solid')

        # Add borders to all cells with color #BFBFBF
        border = Border(left=Side(border_style='thin', color='BFBFBF'),
                        right=Side(border_style='thin', color='BFBFBF'),
                        top=Side(border_style='thin', color='BFBFBF'),
                        bottom=Side(border_style='thin', color='BFBFBF'))

        for row in ws.iter_rows():
            for cell in row:
                cell.border = border

        # Display clan information
        members_count = len(players_data)
        last_updated = datetime.now().strftime("%d/%m/%Y | %H:%M:%S")  # Corrected date and time format
        ws['G2'] = f'Clan: '
        ws['H2'] = f'{clan_name}'
        ws['G3'] = f'Members: '
        ws['H3'] = f'{members_count}'
        ws['G4'] = f'Last Updated: '
        ws['H4'] = f'{last_updated}'

        # Apply formatting to clan information
        info_cells = ['G2', 'H2', 'G3', 'H3', 'G4', 'H4']
        for cell in info_cells:
            ws[cell].font = Font(size=11)
            ws[cell].alignment = Alignment(horizontal='right', vertical='center')

        # Apply fill colors to clan information
        ws['G2'].fill = PatternFill(start_color='FFFFFF', end_color='FFFFFF', fill_type='solid')  # White fill
        ws['H2'].fill = PatternFill(start_color='FFFFFF', end_color='FFFFFF', fill_type='solid')  # White fill
        ws['G3'].fill = PatternFill(start_color='D9D9D9', end_color='D9D9D9', fill_type='solid')  # Light gray fill
        ws['H3'].fill = PatternFill(start_color='D9D9D9', end_color='D9D9D9', fill_type='solid')  # Light gray fill
        ws['G4'].fill = PatternFill(start_color='FFFFFF', end_color='FFFFFF', fill_type='solid')  # White fill
        ws['H4'].fill = PatternFill(start_color='FFFFFF', end_color='FFFFFF', fill_type='solid')  # White fill

        # Apply borders to clan information
        for cell in info_cells:
            ws[cell].border = border

        # Align text in Column G to the right and text in Column H to the left within the clan information
        for row in ws.iter_rows(min_row=2, max_row=4, min_col=7, max_col=8):
            for cell in row:
                if cell.column == 7:  # Column G
                    cell.alignment = Alignment(horizontal='right', vertical='center')
                else:  # Column H
                    cell.alignment = Alignment(horizontal='left', vertical='center')

# Run the asyncio event loop
asyncio.run(main())

# Remove the default sheet
default_sheet = wb['Sheet']
wb.remove(default_sheet)

# Save the workbook
wb.save('clan_data.xlsx')

print("Excel file generated successfully.")

第二个脚本:player_war_scraper.py

import sys
import time
import io
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Placeholder value for the number of rows of information to generate
num_rows = 12

def scrape_clan_wars_info(player_tag):
    # Construct the URL with the player tag variable
    url = f"EXAMPLE URL"

    # Configure Chrome options for headless mode and disable image loading
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # Run Chrome in headless mode
    chrome_options.add_argument("--disable-gpu")  # Disable GPU acceleration
    chrome_options.add_argument("--disable-infobars")  # Disable info bars
    chrome_options.add_argument("--disable-dev-shm-usage")  # Disable /dev/shm usage
    chrome_options.add_argument("--no-sandbox")  # Disable sandbox mode
    chrome_options.add_argument("--disable-extensions")  # Disable extensions
    chrome_options.add_argument("--disable-browser-side-navigation")  # Disable browser side navigation
    chrome_options.add_argument("--disable-features=VizDisplayCompositor")  # Disable viz display compositor
    chrome_options.add_argument("--start-maximized")  # Start maximized
    chrome_options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})  # Disable image loading

    # Initialize Chrome WebDriver with configured options
    driver = webdriver.Chrome(options=chrome_options)

    # Set implicit wait time
    driver.implicitly_wait(10)

    # Open the webpage
    driver.get(url)

    try:
        # Switch to the iframe
        driver.switch_to.frame("sp_message_iframe_1104950")

        # Switch to the document within the iframe
        iframe_document = driver.find_element(By.TAG_NAME, "html")
        driver.switch_to.frame(iframe_document)

        # Locate the "Accept" button within the dropdown
        accept_button = driver.find_element(By.XPATH, "//button[contains(text(), 'Accept')]")

        # Click the "Accept" button
        accept_button.click()

        # Switch back to the default content
        driver.switch_to.default_content()

        # Click the button using JavaScript
        load_button = driver.find_element(By.CLASS_NAME, "cw2_history_button")
        driver.execute_script("arguments[0].click();", load_button)

        # Wait for the table to be loaded
        table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "player__cw2_history_table")))

        # Get the table headers with their respective data content
        headers = [th.get_attribute("data-content") for th in table.find_elements(By.TAG_NAME, "th")]

        # Specify the columns to select
        selected_columns = ['Season ID', 'Date', 'Clan Name', 'Decks Used', 'Fame']

        # Get the table data
        table_html = table.get_attribute('outerHTML')

        # Read the HTML into a DataFrame
        df = pd.read_html(table_html)[0]

        # Rename the columns with their respective data content
        df.columns = headers

        # Select only the specified columns
        df_selected_columns = df[selected_columns]

        # Select only the specified number of rows
        df_selected_rows = df_selected_columns.head(num_rows)

        # Export the DataFrame to an Excel file
        file_name = f"Clan_Wars_2_History_Selected_{player_tag}.xlsx"
        df_selected_rows.to_excel(file_name, index=False)

        print(f"Selected {num_rows} rows of Clan Wars 2 history with specified columns successfully exported to {file_name}")

    finally:
        # Close the WebDriver
        driver.quit()

# Check if player tag is provided as command-line argument
if len(sys.argv) != 2:
    print("Please provide the player tag:")
    sys.exit(1)

player_tag = sys.argv[1]
scrape_clan_wars_info(player_tag)

正如我之前所说,这些脚本 90% 是由 ChatGPT 生成的,我知道这段代码是否可以缩短,或者是否有更好的方法输出某些内容等。但目前,我正在寻求解决这个问题我有。

我已经向 ChatGPT 询问了同样的事情,但是他们将每个玩家标签作为单独的 Excel 文件输出,甚至没有根据部落等进行排序。

python selenium-webdriver screen-scraping export-to-excel
1个回答
0
投票

要使用 Selenium 提取战争统计数据,首先确保您的目标是 JavaScript 运行后加载的正确元素。使用 Selenium 的

WebDriverWait
和预期条件来等待这些元素出现。然后,提取数据并将其附加到现有的 Excel 文件结构中。由于您是 Python 新手,请从基本的 Selenium 命令(如
find_element_by_id
find_element_by_xpath
)开始来查找所需的数据。对于Excel部分,
pandas
可以非常方便地管理和导出数据。把问题分解成更小的任务,然后一项一项地解决。快乐刮擦!

© www.soinside.com 2019 - 2024. All rights reserved.