Python> Selenium：基于文本文件中的链接在“已登录”环境中进行Web爬网

Question

兼容ChromeDriver

该程序试图完成以下任务：

自动登录网站；
访问文本文件中的一个或多个链接；
从这样访问的每个页面中抓取数据；和
通过print（）输出所有抓取的数据。

请跳至Part 2来解决问题区域，因为第1部分已经过测试，可用于第1步。：）

代码：

第1部分

from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()

driver.get("website1.com/home")

main_page = driver.current_window_handle 
time.sleep(5) 

##cookies
driver.find_element_by_xpath('//*[@id="CybotCookiebotDialogBodyButtonAccept"]').click() 
time.sleep(5)

driver.find_element_by_xpath('//*[@id ="google-login"]/span').click() 
for handle in driver.window_handles: 
    if handle != main_page: 
        login_page = handle 

driver.switch_to.window(login_page) 

with open('logindetails.txt', 'r') as file:
   for details in file:
        email, password = details.split(':')

        driver.find_element_by_xpath('//*[@id ="identifierId"]').send_keys(email) 
driver.find_element_by_xpath('//span[text()="Next"]').click()

time.sleep(5)
driver.find_element_by_xpath('//input[@type="password"]').send_keys(password) 

driver.find_element_by_xpath('//span[text()="Next"]').click() 
driver.switch_to.window(main_page) 
time.sleep(5)

第2部分

在alllinks.txt中，我们有以下网站：
•website1.com/otherpage/page1•website1.com/otherpage/page2•website1.com/otherpage/page3

with open('alllinks.txt', 'r') as directory:
    for items in directory:
    driver.get(items)
    time.sleep(2)
    elements = driver.find_elements_by_class_name('data-xl')
    for element in elements:
            print ([element])
    time.sleep(5)


driver.quit()

结果：

[Done] exited with code=0 in 53.463 seconds

...和零输出

问题：

元素的位置已经过验证，怀疑windows与驱动程序未抓取的原因有关。

欢迎所有意见，我们将不胜感激。：）

Answer 1

driver.get()中使用的URL必须包含协议（即https://）。

driver.get('website1.com/otherpage/page1')只会引发异常。

Python> Selenium：基于文本文件中的链接在“已登录”环境中进行Web爬网

问题描述投票：0回答：1

1个回答

最新问题

Python> Selenium：基于文本文件中的链接在“已登录”环境中进行Web爬网

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1