Selenium WebDriver 无法在 YouTube 页面上滚动进行网页抓取

Question

我正在使用 Selenium WebDriver 从 YouTube 频道页面抓取数据，但我遇到了滚动问题。在处理了 30 个视频后，随着页面向下滚动，YouTube 似乎会加载更多视频。但是，我的代码不会自动滚动以加载更多视频。如何使 Selenium 在 YouTube 页面上正确滚动以进行网页抓取？

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Function to scroll to the bottom of the page using JavaScript
def scroll_to_bottom(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Function to extract video info
def extract_video_info(driver, channel_url):
    # Maximize the browser window
    driver.maximize_window()

    # Open the YouTube channel's video page
    driver.get(channel_url)

    # Set implicit wait and page load timeout
    driver.implicitly_wait(30)
    driver.set_page_load_timeout(30)

    video_info_list = []

    # Counter to keep track of processed videos
    video_count = 0

    # Find all video elements
    video_links = driver.find_elements(By.XPATH, '//a[@id="video-title-link"]')

    # Iterate through each video and extract information
    for video_link in video_links:
        video_info = {}
        
        try:
            # Get the video link's href attribute
            video_href = video_link.get_attribute("href")
            
            # Open the video link in a new tab
            driver.execute_script(f"window.open('{video_href}','_blank');")
            
            # Switch to the new tab
            driver.switch_to.window(driver.window_handles[1])
            
            # Wait for the video description to load
            wait = WebDriverWait(driver, 20)
            wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="description"]/yt-formatted-string')))
            
            # Click on the video description to expand it
            description = driver.find_element(By.XPATH, '//*[@id="description"]/yt-formatted-string')
            driver.execute_script("arguments[0].click();", description)
            
            # Wait for the likes button and info container to load
            wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="segmented-like-button"]/ytd-toggle-button-renderer/yt-button-shape/button')))
            wait.until(EC.presence_of_element_located((By.XPATH, '//div[@id="info-container"]')))
            
            # Extract information
            likes = driver.find_element(By.XPATH, '//*[@id="segmented-like-button"]/ytd-toggle-button-renderer/yt-button-shape/button').text.strip()
            info_container = driver.find_element(By.XPATH, '//div[@id="info-container"]')
            info_text = info_container.text
            
            # Extract views
            views = info_text.split(" views")[0].strip()
            
            # Extract date if it exists, else set to "Not found"
            date = "Not found"
            date_split = info_text.split(" views ")
            if len(date_split) > 1:
                date = date_split[1].split("\n")[0].strip()
            
            # Add extracted information to the video_info dictionary
            video_info['Likes'] = likes
            video_info['Views'] = views
            video_info['Date'] = date
            
            # Append video_info to the video_info_list
            video_info_list.append(video_info)
            
            # Close the current tab and switch back to the original tab
            driver.close()
            driver.switch_to.window(driver.window_handles[0])
            
            # Increment the video counter
            video_count += 1
            
            if video_count % 30 == 0:
                scroll_to_bottom(driver)
            
        except Exception as e:
            print(f"Error extracting video info: {e}")
    
    # Close the WebDriver when done (outside of the loop)
    driver.quit()

    return video_info_list

edge_driver_path = r'C:/Users/hp/msedgedriver.exe'

driver = webdriver.Edge(executable_path=edge_driver_path)

channel_url = "https://www.youtube.com/@Moosashi/videos"
video_info_list = extract_video_info(driver, channel_url)

for i, video_info in enumerate(video_info_list):
    print(f"Video {i+1}:")
    print(f"Likes: {video_info['Likes']}")
    print(f"Views: {video_info['Views']}")
    print(f"Date: {video_info['Date']}")
    print("\n---\n")

我的输出就在这里停止：视频28：喜欢：41 观看次数：1.1K 日期：5 个月前 #sqlserver

视频29：喜欢：41 浏览次数：766 日期：5 个月前 #ai #BardVsChatgpt #bardai

视频30：喜欢：79 观看次数：1.7K 日期：5 个月前适用于数据科学的 Amazon Web Services (AWS)

代码在 30 个视频后不再继续。我无法使用 API。另外，如果我点击视频描述获得正确的视频描述代码，我将不胜感激。日期和视图显示正确，但该代码不起作用。否则，如果滚动问题和驱动程序继续获取视频数据能够得到解决，我将不胜感激。我的任务是读取2023年9月10日到2021年9月10日频道内的所有视频数据。

Answer 1

我认为您的代码会遇到一些问题，但为了回答您最初的问题，YouTube 上

window.body

的高度看起来是 0，并且整个页面的内容都在子元素内

<ytd-app>

具有绝对位置，基本上就像 body 元素一样。

要解决此问题，请将

scroll_to_bottom()

函数中的滚动 js 脚本替换为

window.scrollTo(0, document.documentElement.scrollHeight)

我看到的另一个问题是，假设您尝试打开频道页面上的每个视频，则需要循环播放滚动后加载的视频。但现在你的代码看起来像这样：

获取频道页面
获取第一批30个加载的视频链接
循环每个链接，抓取数据
如果抓取的视频总数为 %30，请滚动到页面底部
退出循环
退出应用程序

因此，您永远不会对从滚动到底部加载的数据执行任何操作。

希望这有帮助

Selenium WebDriver 无法在 YouTube 页面上滚动进行网页抓取

问题描述投票：0回答：1

1个回答

最新问题

Selenium WebDriver 无法在 YouTube 页面上滚动进行网页抓取

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1