在Python中使用Selenium进行并行化

Question

我试图并行化一个循环的执行，该循环使用selenium从网站检索数据。在我的循环中，我循环遍历我之前创建的URL URLlist列表。

首先，我登录到页面，从而创建webdriver的实例。

browser = webdriver.Chrome(executable_path='chromedriver.exe')
browser.get('https://somepage.com')
username = browser.find_element_by_id("email")
password = browser.find_element_by_id("password")
username.send_keys("[email protected]")
password.send_keys("pwd123")
browser.find_element_by_id("login-button").click()

然后我的循环启动并调用一些在页面上运行的函数。

for url in URLlist:
   browser.get(url)
   data1 = do_stuff()
   data2 = do_other_stuff()

我不知道从哪里开始，因为我可以想象我需要每个线程的webdriver实例。

这样做的正确（也许是最简单）方法是什么？

Answer 1

您需要在单独的.py文件中创建测试方法，安装pytest库包并使用pytest调用.py文件。从cmd启动python并在这些行上尝试一些东西：

-m pytest -n 3 C:\test_file.py --html=C:\Report.html

在这种情况下，3种测试方法将并行运行

Answer 2

为了简化Web抓取的并行化，您需要安装numpy。

python -m pip install numpy

完成后，您可以轻松实现您想要的。这是一个简单的例子：

import threading
import numpy as np

#tupel to save the Threads
threads = []

threadCount = 5 #Number of Threads you want

#Custom Thread class 
class doStuffThread(threading.Thread):
    def __init__(self, partLinks):
        threading.Thread.__init__(self)
        self.partLinks = partLinks
    def run(self):
        #New browser instance for each Thread
        browser = webdriver.Chrome(executable_path='chromedriver.exe')
        for link in self.partLinks:
            browser.get(link)
            doStuff(link)
            doOtherStuff(link)

#Split the links to give each thread a part of them
for  partLinks in np.array_split(links,threadCount):
     t = CommentCrawlerThread(partlinks)
     threads.append(t)
     t.start()
#wait till all Threads are finished
for x in threads:
    x.join()

在Python中使用Selenium进行并行化

问题描述投票：0回答：2

2个回答

最新问题

在Python中使用Selenium进行并行化

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2