我正在使用 Selenium WebDriver 从 YouTube 频道页面抓取数据,但我遇到了滚动问题。在处理了 30 个视频后,随着页面向下滚动,YouTube 似乎会加载更多视频。但是,我的代码不会自动滚动以加载更多视频。如何使 Selenium 在 YouTube 页面上正确滚动以进行网页抓取?
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Function to scroll to the bottom of the page using JavaScript
def scroll_to_bottom(driver):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Function to extract video info
def extract_video_info(driver, channel_url):
# Maximize the browser window
driver.maximize_window()
# Open the YouTube channel's video page
driver.get(channel_url)
# Set implicit wait and page load timeout
driver.implicitly_wait(30)
driver.set_page_load_timeout(30)
video_info_list = []
# Counter to keep track of processed videos
video_count = 0
# Find all video elements
video_links = driver.find_elements(By.XPATH, '//a[@id="video-title-link"]')
# Iterate through each video and extract information
for video_link in video_links:
video_info = {}
try:
# Get the video link's href attribute
video_href = video_link.get_attribute("href")
# Open the video link in a new tab
driver.execute_script(f"window.open('{video_href}','_blank');")
# Switch to the new tab
driver.switch_to.window(driver.window_handles[1])
# Wait for the video description to load
wait = WebDriverWait(driver, 20)
wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="description"]/yt-formatted-string')))
# Click on the video description to expand it
description = driver.find_element(By.XPATH, '//*[@id="description"]/yt-formatted-string')
driver.execute_script("arguments[0].click();", description)
# Wait for the likes button and info container to load
wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="segmented-like-button"]/ytd-toggle-button-renderer/yt-button-shape/button')))
wait.until(EC.presence_of_element_located((By.XPATH, '//div[@id="info-container"]')))
# Extract information
likes = driver.find_element(By.XPATH, '//*[@id="segmented-like-button"]/ytd-toggle-button-renderer/yt-button-shape/button').text.strip()
info_container = driver.find_element(By.XPATH, '//div[@id="info-container"]')
info_text = info_container.text
# Extract views
views = info_text.split(" views")[0].strip()
# Extract date if it exists, else set to "Not found"
date = "Not found"
date_split = info_text.split(" views ")
if len(date_split) > 1:
date = date_split[1].split("\n")[0].strip()
# Add extracted information to the video_info dictionary
video_info['Likes'] = likes
video_info['Views'] = views
video_info['Date'] = date
# Append video_info to the video_info_list
video_info_list.append(video_info)
# Close the current tab and switch back to the original tab
driver.close()
driver.switch_to.window(driver.window_handles[0])
# Increment the video counter
video_count += 1
if video_count % 30 == 0:
scroll_to_bottom(driver)
except Exception as e:
print(f"Error extracting video info: {e}")
# Close the WebDriver when done (outside of the loop)
driver.quit()
return video_info_list
edge_driver_path = r'C:/Users/hp/msedgedriver.exe'
driver = webdriver.Edge(executable_path=edge_driver_path)
channel_url = "https://www.youtube.com/@Moosashi/videos"
video_info_list = extract_video_info(driver, channel_url)
for i, video_info in enumerate(video_info_list):
print(f"Video {i+1}:")
print(f"Likes: {video_info['Likes']}")
print(f"Views: {video_info['Views']}")
print(f"Date: {video_info['Date']}")
print("\n---\n")
我的输出就在这里停止: 视频28: 喜欢:41 观看次数:1.1K 日期:5 个月前 #sqlserver
视频29: 喜欢:41 浏览次数:766 日期:5 个月前 #ai #BardVsChatgpt #bardai
视频30: 喜欢:79 观看次数:1.7K 日期:5 个月前 适用于数据科学的 Amazon Web Services (AWS)
代码在 30 个视频后不再继续。我无法使用 API。另外,如果我点击视频描述获得正确的视频描述代码,我将不胜感激。日期和视图显示正确,但该代码不起作用。否则,如果滚动问题和驱动程序继续获取视频数据能够得到解决,我将不胜感激。我的任务是读取2023年9月10日到2021年9月10日频道内的所有视频数据。
我认为您的代码会遇到一些问题,但为了回答您最初的问题,YouTube 上
window.body
的高度看起来是 0,并且整个页面的内容都在子元素内<ytd-app>
具有绝对位置,基本上就像 body 元素一样。
要解决此问题,请将
scroll_to_bottom()
函数中的滚动 js 脚本替换为 window.scrollTo(0, document.documentElement.scrollHeight)
我看到的另一个问题是,假设您尝试打开频道页面上的每个视频,则需要循环播放滚动后加载的视频。但现在你的代码看起来像这样:
因此,您永远不会对从滚动到底部加载的数据执行任何操作。
希望这有帮助