初学者使用 selenium 和 python 编写从多个网页抓取链接、文本、图像的代码并存储在 Excel 中

问题描述 投票:0回答:1

这是我写的代码:

from selenium import webdriver
import pandas as pd
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By

# List of URLs to scrape
urls = ["https://www.monolithai.com/blog/4-ways-ai-is-changing-the-packaging-industry"
,"https://mitsubishisolutions.com/the-role-of-artificial-intelligence-in-smart-packaging-lines"
, "https://thedatascientist.com/how-artificial-intelligence-is-revolutionizing-the-packaging-industry/"
," https://packagingeurope.com/comment/ai-and-the-future-of-packaging/9665.article"]

# Initialize the WebDriver
driver = webdriver.Chrome()  # Use appropriate WebDriver for your browser
wait = WebDriverWait(driver,10)

# Initialize empty lists to store scraped data

all_text = []
all_images = []
all_links = []

# Iterate over each URL and scrape text, images, and links
for url in urls:
    driver.get(url)

    body= wait.until(EC.presence_of_element_located((By.TAG_NAME,'body')))
    
# Scrape text
    page_text = driver.find_element_by_tag_name('body').text
    all_text.append(page_text)
    
# Scrape images
    images = driver.find_elements_by_tag_name('img')
    image_urls = [img.get_attribute('src') for img in images]
    all_images.append(image_urls)
    
# Scrape links
    links = driver.find_elements_by_tag_name('a')
    link_urls = [link.get_attribute('href') for link in links]
    all_links.append(link_urls)

# Close the WebDriver when finished 
#driver.quit() 

# Create a DataFrame from the scraped data
data = {
    'URL': urls,
    'Text': all_text,
    'Images': all_images,
    'Links': all_links
}
df = pd.DataFrame(data)

# Save the DataFrame to an Excel file
df.to_excel('scraped_data.xlsx', index=False)

出现以下错误:

DevTools 监听 ws://127.0.0.1:56991/devtools/browser/8be11b91-e7ec-4f18-949e-7319a4341af5 回溯(最近一次调用最后一次):文件“c:\Users\PRADEEP BIRARE\Desktop\web3.py",第 29 行,在 page_text = driver.find_element_by_tag_name('body').text 属性错误: “WebDriver”对象没有属性“find_element_by_tag_name”PS C:\Users\PRADEEP BIRARE>

python pandas excel selenium-webdriver web-scraping
1个回答
0
投票

该错误源于 Selenium 4.3.0+ 中已弃用的 find_element_by_* 方法。

修复:将它们替换为 find_element(By.TAG_NAME, 'body') (与图像/链接类似)。

考虑使用 driver.quit() 后关闭 WebDriver。

© www.soinside.com 2019 - 2024. All rights reserved.