Pyton/Selenium 代码只是跳过一些代码元素

问题描述 投票:0回答:1

大家好。

我希望它转到 select_page 中输入的页面并将其滚动到末尾,检查这是否是末尾,然后加载图像,然后转到另一个页面或结束(如果 end_page 与当前页面和 a 中的所有内容匹配)圆圈。

当转到第一页时,一切正常,但当转到第二页时,它会跳过滚动部分并开始下载,或者只需单击通向下一页的按钮。

请帮忙。

这是代码:

from reportlab.lib.pagesizes import letter
from reportlab.lib.utils import ImageReader
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from io import BytesIO
from time import sleep
from requests import get
from reportlab.pdfgen import canvas
from PIL import Image as PilImage

url = 'https://honey-manga.com.ua'

def download_images(name_manga, select_page, end_page):
    # Create a PDF file
    pdf_file = f'{name_manga}.pdf'
    c = canvas.Canvas(pdf_file, pagesize=letter)
    # Initialize the webdriver
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')  # Activate headless mode
    driver = webdriver.Chrome()

    # Navigate to the webpage
    driver.get(url)
    
    # Find and click the search button
    button_search = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//div[@class='flex items-center gap-x-4']"))
    )
    button_search = button_search.find_element(By.XPATH, "./div/button")
    button_search.click()
    
    # Enter the manga name into the search input field
    search = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "input"))
    )
    search.send_keys(name_manga)
    
    # Click on the search result
    button_res = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//a[@class='flex gap-x-4']"))
    )
    button_res.click()
    
    # Click on the read button
    button_read = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//div[@class='mt-6 md:flex md:items-center max-md:grid max-md:grid-cols-12 gap-x-2']"))
    )
    button_read = button_read.find_element(By.XPATH, "./button[2]")
    button_read.click()
    
    # Navigate to the selected page if specified
    if select_page != '':
        while True:
            list = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, "//div[@class='flex items-center justify-center md:gap-x-2']"))
            )
            page = list.find_element(By.XPATH, "./button")
            page = str(page.text)
            if page:
                page = page[page.index("-"):]
                page = int(''.join(filter(str.isdigit, page)))
                print(page)
            if page == int(select_page):
                print("Go")
                break
            else:
                try:
                    button = list.find_element(By.XPATH, "./a[2]")
                    button.click()
                except Exception:
                    # If the "Next" button is not found, exit the loop
                    break
    
    # Main loop to scroll through and capture images
    while True:
        step = 1000  # Set the desired scroll step size
        current_scroll_position = 0
        
        # Scroll through the page until reaching the bottom
        while True:
            page_height = driver.execute_script("return document.body.scrollHeight;")
            # Scroll the page
            for i in range(current_scroll_position, page_height, step):
                driver.execute_script(f"window.scrollTo(0, {i});")
                sleep(0.1)
                current_scroll_position = i
            # Check if the entire page has been scrolled
            new_page_height = driver.execute_script("return document.body.scrollHeight;")
            if new_page_height == page_height:
                break
        
        sleep(5)  # Wait for a few seconds for the page to load completely
        
        # Find all images on the page
        images = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//div[@class='md:container']"))
        )
        images = images.find_elements(By.XPATH, ".//img")
        
        # Download and add images to the PDF
        for img in images:
            img_url = img.get_attribute('src')
            if img_url:
                try:
                    response = get(img_url)
                    img_data = BytesIO(response.content)
                    pil_image = PilImage.open(img_data)
                    img_width, img_height = pil_image.size
                    pdf_width, pdf_height = img_width, img_height  # Set PDF page size based on image size
                    c.setPageSize((pdf_width, pdf_height))  # Set the PDF page size
                    c.drawImage(ImageReader(img_data), 0, 0, pdf_width, pdf_height)  # Add image to PDF
                    c.showPage()  # Add a new page to the PDF
                except Exception as e:
                    print(f"Error adding image to PDF: {e}")
        
        # Scroll back to the top of the page
        driver.execute_script(f"window.scrollTo(0, 0);")
        
        # Check if the current page matches the end_page
        list = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located(
                (By.XPATH, "//div[@class='flex items-center justify-center md:gap-x-2']"))
        )
        page = list.find_element(By.XPATH, "./button")
        page = page.text
        page = page[page.index("-"):]
        page = int(''.join(filter(str.isdigit, page)))
        print(page)
        if page == int(end_page):
            print("Session ended page")
            break
        
        # Click on the next page button
        list = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located(
                (By.XPATH, "//div[@class='flex items-center justify-center md:gap-x-2']"))
        )
        sleep(5)
        button = list.find_element(By.XPATH, "./a[2]")
        button.click()
    
    # Save the PDF file
    c.save()
    
    # Quit the webdriver
    driver.quit()

python selenium-webdriver web-scraping pycharm
1个回答
0
投票

在您的代码中,有一些要点和更正可能有助于提高其可靠性和性能,特别是在页面导航和动态处理元素时。以下是调整和优化 Python/Selenium 脚本的方法:

带有选项的WebDriver初始化:您开始为Chrome驱动程序配置选项,但在初始化驱动程序时没有应用它们。您应该在创建 webdriver.Chrome() 实例时传递这些选项,以正确启用无头模式或任何其他浏览器选项。

driver = webdriver.Chrome(选项=选项)

管理等待和延迟:最好使用 WebDriverWait 和预期条件来等待满足特定条件(例如元素可见性或可点击性),而不是使用强制固定延迟的 sleep() 。这种方法有助于提高脚本效率,并且不易因计时问题而出错。

XPath 有效性:确保用于定位元素的 XPath 正确且可靠。 Web 元素可能会随着时间的推移而发生变化,或者在不同页面上的行为有所不同,因此请根据需要定期检查和更新这些 XPath。

图像加载和 PDF 创建:下载图像并将其添加到 PDF 的部分似乎按预期运行。但是,请确保正确处理异常,以了解在无法将图像添加到 PDF 的情况下可能出现的问题。

循环终止和驱动程序关闭:最好在脚本末尾包含清理代码以退出驱动程序。正确关闭驱动程序可以防止内存泄漏并确保浏览器正确关闭。

保存PDF文件

c.save()

退出网络驱动程序

driver.quit()

最后,请务必验证您正在使用的基本 URL (url = 'https://honey-manga.com.ua') 是否正确,并且您的脚本流程是否与网站的预期导航和交互模式相匹配。如果网站发生更改,您的脚本可能需要更新才能继续正常工作。

© www.soinside.com 2019 - 2024. All rights reserved.