Selenium WebDriver 无法在 YouTube 页面上滚动进行网页抓取

问题描述 投票:0回答:1

我正在使用 Selenium WebDriver 从 YouTube 频道页面抓取数据,但我遇到了滚动问题。在处理了 30 个视频后,随着页面向下滚动,YouTube 似乎会加载更多视频。但是,我的代码不会自动滚动以加载更多视频。如何使 Selenium 在 YouTube 页面上正确滚动以进行网页抓取?

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Function to scroll to the bottom of the page using JavaScript
def scroll_to_bottom(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Function to extract video info
def extract_video_info(driver, channel_url):
    # Maximize the browser window
    driver.maximize_window()

    # Open the YouTube channel's video page
    driver.get(channel_url)

    # Set implicit wait and page load timeout
    driver.implicitly_wait(30)
    driver.set_page_load_timeout(30)

    video_info_list = []

    # Counter to keep track of processed videos
    video_count = 0

    # Find all video elements
    video_links = driver.find_elements(By.XPATH, '//a[@id="video-title-link"]')

    # Iterate through each video and extract information
    for video_link in video_links:
        video_info = {}
        
        try:
            # Get the video link's href attribute
            video_href = video_link.get_attribute("href")
            
            # Open the video link in a new tab
            driver.execute_script(f"window.open('{video_href}','_blank');")
            
            # Switch to the new tab
            driver.switch_to.window(driver.window_handles[1])
            
            # Wait for the video description to load
            wait = WebDriverWait(driver, 20)
            wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="description"]/yt-formatted-string')))
            
            # Click on the video description to expand it
            description = driver.find_element(By.XPATH, '//*[@id="description"]/yt-formatted-string')
            driver.execute_script("arguments[0].click();", description)
            
            # Wait for the likes button and info container to load
            wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="segmented-like-button"]/ytd-toggle-button-renderer/yt-button-shape/button')))
            wait.until(EC.presence_of_element_located((By.XPATH, '//div[@id="info-container"]')))
            
            # Extract information
            likes = driver.find_element(By.XPATH, '//*[@id="segmented-like-button"]/ytd-toggle-button-renderer/yt-button-shape/button').text.strip()
            info_container = driver.find_element(By.XPATH, '//div[@id="info-container"]')
            info_text = info_container.text
            
            # Extract views
            views = info_text.split(" views")[0].strip()
            
            # Extract date if it exists, else set to "Not found"
            date = "Not found"
            date_split = info_text.split(" views ")
            if len(date_split) > 1:
                date = date_split[1].split("\n")[0].strip()
            
            # Add extracted information to the video_info dictionary
            video_info['Likes'] = likes
            video_info['Views'] = views
            video_info['Date'] = date
            
            # Append video_info to the video_info_list
            video_info_list.append(video_info)
            
            # Close the current tab and switch back to the original tab
            driver.close()
            driver.switch_to.window(driver.window_handles[0])
            
            # Increment the video counter
            video_count += 1
            
            if video_count % 30 == 0:
                scroll_to_bottom(driver)
            
        except Exception as e:
            print(f"Error extracting video info: {e}")
    
    # Close the WebDriver when done (outside of the loop)
    driver.quit()

    return video_info_list

edge_driver_path = r'C:/Users/hp/msedgedriver.exe'

driver = webdriver.Edge(executable_path=edge_driver_path)

channel_url = "https://www.youtube.com/@Moosashi/videos"
video_info_list = extract_video_info(driver, channel_url)

for i, video_info in enumerate(video_info_list):
    print(f"Video {i+1}:")
    print(f"Likes: {video_info['Likes']}")
    print(f"Views: {video_info['Views']}")
    print(f"Date: {video_info['Date']}")
    print("\n---\n")

我的输出就在这里停止: 视频28: 喜欢:41 观看次数:1.1K 日期:5 个月前 #sqlserver


视频29: 喜欢:41 浏览次数:766 日期:5 个月前 #ai #BardVsChatgpt #bardai


视频30: 喜欢:79 观看次数:1.7K 日期:5 个月前 适用于数据科学的 Amazon Web Services (AWS)

代码在 30 个视频后不再继续。我无法使用 API。另外,如果我点击视频描述获得正确的视频描述代码,我将不胜感激。日期和视图显示正确,但该代码不起作用。否则,如果滚动问题和驱动程序继续获取视频数据能够得到解决,我将不胜感激。我的任务是读取2023年9月10日到2021年9月10日频道内的所有视频数据。

python selenium-webdriver web-scraping youtube webdriver
1个回答
0
投票

我认为您的代码会遇到一些问题,但为了回答您最初的问题,YouTube 上

window.body
的高度看起来是 0,并且整个页面的内容都在子元素内
<ytd-app>
具有绝对位置,基本上就像 body 元素一样。

要解决此问题,请将

scroll_to_bottom()
函数中的滚动 js 脚本替换为
window.scrollTo(0, document.documentElement.scrollHeight)

我看到的另一个问题是,假设您尝试打开频道页面上的每个视频,则需要循环播放滚动后加载的视频。但现在你的代码看起来像这样:

  1. 获取频道页面
  2. 获取第一批30个加载的视频链接
  3. 循环每个链接,抓取数据
  4. 如果抓取的视频总数为 %30,请滚动到页面底部
  5. 退出循环
  6. 退出应用程序

因此,您永远不会对从滚动到底部加载的数据执行任何操作。

希望这有帮助

© www.soinside.com 2019 - 2024. All rights reserved.