如何在多线程中运行`selenium-chromedriver`

Question

我正在使用

selenium

和

chrome-driver

从某些页面抓取数据，然后使用该信息运行一些其他任务（例如，在某些页面上输入一些评论）

我的程序有一个按钮。每次按下它都会调用

thread_(self)

（如下），启动一个新线程。目标函数

self.main

具有在

chrome-driver

上运行所有 selenium 工作的代码。

def thread_(self):
    th = threading.Thread(target=self.main)
    th.start()

我的问题是用户第一次按下后。这个

th

线程将打开浏览器A并执行一些操作。当浏览器 A 正在执行某些操作时，用户将再次按下按钮并打开运行相同 self.main 的浏览器

。 我希望打开的每个浏览器同时运行。我面临的问题是，当我运行该线程函数时，第一个浏览器停止并打开第二个浏览器。

我知道我的代码可以无限地创建线程。我知道这会影响电脑性能，但我对此表示同意。 我想加速完成工作

self.main

！

Answer 1

Threading

selenium

加速

考虑以下函数来举例说明使用 selenium 的线程与单个驱动程序方法相比如何提供一定的加速。下面的代码scraps来自selenium使用

BeautifulSoup

打开的页面的html标题。页面列表是

links

。

import time
from bs4 import BeautifulSoup
from selenium import webdriver
import threading

def create_driver():
   """returns a new chrome webdriver"""
   chromeOptions = webdriver.ChromeOptions()
   chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers
   return webdriver.Chrome(options=chromeOptions)  

def get_title(url, webdriver=None):  
   """get the url html title using BeautifulSoup 
   if driver is None uses a new chrome-driver and quit() after
   otherwise uses the driver provided and don't quit() after"""
   def print_title(driver):
      driver.get(url)
      soup = BeautifulSoup(driver.page_source,"lxml")
      item = soup.find('title')
      print(item.string.strip())

   if webdriver:
      print_title(webdriver)  
   else: 
      webdriver = create_driver()
      print_title(webdriver)   
      webdriver.quit()

links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/", 
"https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]

立即拨打上面的

get_tile

links

。

顺序方法

单个 Chrome 驱动程序并按顺序传递所有链接。我的机器需要 22.3 秒（注：Windows）。

start_time = time.time()
driver = create_driver()

for link in links: # could be 'like' clicks 
  get_title(link, driver)  

driver.quit()
print("sequential took ", (time.time() - start_time), " seconds")

多线程方法

每个链接使用一个线程。 10.5 秒内即可获得结果，速度快了 2 倍以上。

start_time = time.time()    
threads = [] 
for link in links: # each thread could be like a new 'click' 
    th = threading.Thread(target=get_title, args=(link,))    
    th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option
    threads.append(th)        
for th in threads:
    th.join() # Main thread wait for threads finish
print("multiple threads took ", (time.time() - start_time), " seconds")

这里和这个更好是其他一些工作示例。第二个在

ThreadPool

上使用固定数量的线程。并建议存储在每个线程上初始化的

chrome-driver

实例比每次创建启动它更快。

我仍然不确定这是否是 Selenium 获得相当大加速的最佳方法。自从

threading

上无 IO 绑定代码最终将按顺序执行（一个线程接着一个线程）。由于 Python GIL（全局解释器锁），Python 进程无法并行运行线程（利用多个 cpu 核心）。

Processes

selenium

加速

为了尝试使用包

multiprocessing

和

Processes

类克服 Python GIL 限制，我编写了以下代码并运行了多个测试。我什至在上面的

get_title

功能上添加了随机页面超链接点击。附加代码在这里。

start_time = time.time() 

processes = [] 
for link in links: # each thread a new 'click' 
    ps = multiprocessing.Process(target=get_title, args=(link,))    
    ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)``
    processes.append(ps)        
for ps in processes:
    ps.join() # Main wait for processes finish

return (time.time() - start_time)

与我的预期相反，Python

multiprocessing.Process

基于并行性的

selenium

平均比threading.Thread
慢8%左右。
但显然两者平均都比顺序方法快两倍以上。刚刚发现 selenium

 chrome-driver 命令使用

HTTP-Requets

（如

POST

、

GET

），因此它是 I/O 限制的，因此它释放了 Python GIL，确实使其在线程中并行。

Threading

selenium

 加速 **

的良好开端

这不是一个明确的答案，因为我的测试只是一个很小的例子。另外，我使用的是 Windows，并且 multiprocessing

 在这种情况下有很多限制。每个新的

Process

 都不像 Linux 中的分支，这意味着，除了其他缺点之外，还浪费了大量内存。

考虑到所有这些：根据用例，线程似乎可能与尝试使用进程的较重方法一样好或更好（特别是对于 Windows 用户）。

Answer 2

试试这个：

def thread_(self):
    th = threading.Thread(target=self.main)
    self.jobs.append(th)
    th.start()

信息：

https://pymotw.com/2/threading/

如何在多线程中运行`selenium-chromedriver`

问题描述投票：0回答：2

2个回答

`Threading`

`selenium`
加速

`Processes`

`selenium`
加速

`Threading`

`selenium`
加速 **
的良好开端

最新问题

如何在多线程中运行`selenium-chromedriver`

问题描述 投票：0回答：2

2个回答

Threading selenium 加速

Processes selenium 加速

Threading selenium 加速 ** 的良好开端

最新问题

问题描述投票：0回答：2

`Threading`

`selenium`
加速

`Processes`

`selenium`
加速

`Threading`

`selenium`
加速 **
的良好开端