大家好。
我希望它转到 select_page 中输入的页面并将其滚动到末尾,检查这是否是末尾,然后加载图像,然后转到另一个页面或结束(如果 end_page 与当前页面和 a 中的所有内容匹配)圆圈。
当转到第一页时,一切正常,但当转到第二页时,它会跳过滚动部分并开始下载,或者只需单击通向下一页的按钮。
请帮忙。
这是代码:
from reportlab.lib.pagesizes import letter
from reportlab.lib.utils import ImageReader
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from io import BytesIO
from time import sleep
from requests import get
from reportlab.pdfgen import canvas
from PIL import Image as PilImage
url = 'https://honey-manga.com.ua'
def download_images(name_manga, select_page, end_page):
# Create a PDF file
pdf_file = f'{name_manga}.pdf'
c = canvas.Canvas(pdf_file, pagesize=letter)
# Initialize the webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless') # Activate headless mode
driver = webdriver.Chrome()
# Navigate to the webpage
driver.get(url)
# Find and click the search button
button_search = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='flex items-center gap-x-4']"))
)
button_search = button_search.find_element(By.XPATH, "./div/button")
button_search.click()
# Enter the manga name into the search input field
search = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "input"))
)
search.send_keys(name_manga)
# Click on the search result
button_res = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//a[@class='flex gap-x-4']"))
)
button_res.click()
# Click on the read button
button_read = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='mt-6 md:flex md:items-center max-md:grid max-md:grid-cols-12 gap-x-2']"))
)
button_read = button_read.find_element(By.XPATH, "./button[2]")
button_read.click()
# Navigate to the selected page if specified
if select_page != '':
while True:
list = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='flex items-center justify-center md:gap-x-2']"))
)
page = list.find_element(By.XPATH, "./button")
page = str(page.text)
if page:
page = page[page.index("-"):]
page = int(''.join(filter(str.isdigit, page)))
print(page)
if page == int(select_page):
print("Go")
break
else:
try:
button = list.find_element(By.XPATH, "./a[2]")
button.click()
except Exception:
# If the "Next" button is not found, exit the loop
break
# Main loop to scroll through and capture images
while True:
step = 1000 # Set the desired scroll step size
current_scroll_position = 0
# Scroll through the page until reaching the bottom
while True:
page_height = driver.execute_script("return document.body.scrollHeight;")
# Scroll the page
for i in range(current_scroll_position, page_height, step):
driver.execute_script(f"window.scrollTo(0, {i});")
sleep(0.1)
current_scroll_position = i
# Check if the entire page has been scrolled
new_page_height = driver.execute_script("return document.body.scrollHeight;")
if new_page_height == page_height:
break
sleep(5) # Wait for a few seconds for the page to load completely
# Find all images on the page
images = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='md:container']"))
)
images = images.find_elements(By.XPATH, ".//img")
# Download and add images to the PDF
for img in images:
img_url = img.get_attribute('src')
if img_url:
try:
response = get(img_url)
img_data = BytesIO(response.content)
pil_image = PilImage.open(img_data)
img_width, img_height = pil_image.size
pdf_width, pdf_height = img_width, img_height # Set PDF page size based on image size
c.setPageSize((pdf_width, pdf_height)) # Set the PDF page size
c.drawImage(ImageReader(img_data), 0, 0, pdf_width, pdf_height) # Add image to PDF
c.showPage() # Add a new page to the PDF
except Exception as e:
print(f"Error adding image to PDF: {e}")
# Scroll back to the top of the page
driver.execute_script(f"window.scrollTo(0, 0);")
# Check if the current page matches the end_page
list = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(
(By.XPATH, "//div[@class='flex items-center justify-center md:gap-x-2']"))
)
page = list.find_element(By.XPATH, "./button")
page = page.text
page = page[page.index("-"):]
page = int(''.join(filter(str.isdigit, page)))
print(page)
if page == int(end_page):
print("Session ended page")
break
# Click on the next page button
list = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(
(By.XPATH, "//div[@class='flex items-center justify-center md:gap-x-2']"))
)
sleep(5)
button = list.find_element(By.XPATH, "./a[2]")
button.click()
# Save the PDF file
c.save()
# Quit the webdriver
driver.quit()
在您的代码中,有一些要点和更正可能有助于提高其可靠性和性能,特别是在页面导航和动态处理元素时。以下是调整和优化 Python/Selenium 脚本的方法:
带有选项的WebDriver初始化:您开始为Chrome驱动程序配置选项,但在初始化驱动程序时没有应用它们。您应该在创建 webdriver.Chrome() 实例时传递这些选项,以正确启用无头模式或任何其他浏览器选项。
driver = webdriver.Chrome(选项=选项)
管理等待和延迟:最好使用 WebDriverWait 和预期条件来等待满足特定条件(例如元素可见性或可点击性),而不是使用强制固定延迟的 sleep() 。这种方法有助于提高脚本效率,并且不易因计时问题而出错。
XPath 有效性:确保用于定位元素的 XPath 正确且可靠。 Web 元素可能会随着时间的推移而发生变化,或者在不同页面上的行为有所不同,因此请根据需要定期检查和更新这些 XPath。
图像加载和 PDF 创建:下载图像并将其添加到 PDF 的部分似乎按预期运行。但是,请确保正确处理异常,以了解在无法将图像添加到 PDF 的情况下可能出现的问题。
循环终止和驱动程序关闭:最好在脚本末尾包含清理代码以退出驱动程序。正确关闭驱动程序可以防止内存泄漏并确保浏览器正确关闭。
c.save()
driver.quit()
最后,请务必验证您正在使用的基本 URL (url = 'https://honey-manga.com.ua') 是否正确,并且您的脚本流程是否与网站的预期导航和交互模式相匹配。如果网站发生更改,您的脚本可能需要更新才能继续正常工作。