我正在尝试使用共享列表,该列表将更新Selenium的抓取信息,以便以后可以导出此信息或以我选择的方式使用它。由于某种原因,它给了我这个错误:NameError:名称“ scrapedinfo”未定义...
这对我来说真的很奇怪,因为我将列表声明为Global,然后我使用multiprocessing.Manager()创建了列表。我已经多次检查代码,这不是区分大小写的错误。我还尝试通过函数将列表作为变量传递,但这会引起其他问题,并且无法正常工作。任何帮助是极大的赞赏!
from selenium import webdriver
from multiprocessing import Pool
def browser():
driver = webdriver.Chrome()
return driver
def test_func(link):
driver = browser()
driver.get(link)
def scrape_stuff(driver):
#Scrape things
scrapedinfo.append(#Scraped Stuff)
def multip():
manager = Manager()
#Declare list here
global scrapedinfo
scrapedinfo = manager.list()
links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
chunks = [links[i::3] for i in range(3)]
pool = Pool(processes=3)
pool.map(test_func, chunks)
print(scrapedinfo)
multip()
在Windows中,多处理会执行一个新的python进程,然后尝试为该子进程的父级状态腌制/解开腌制。不包括未在map
调用中传递的全局变量。未在子级中创建scrapedinfo
,并且出现错误。
一种解决方案是在地图调用中传递scrapedinfo
。整理一个简单的例子,
from multiprocessing import Pool, Manager
def test_func(param):
scrapedinfo, link = param
scrapedinfo.append("i scraped stuff from " + str(link))
def multip():
manager = Manager()
global scrapedinfo
scrapedinfo = manager.list()
links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
chunks = [links[i::3] for i in range(3)]
pool = Pool(processes=3)
pool.map(test_func, list((scrapedinfo, chunk) for chunk in chunks))
print(scrapedinfo)
if __name__=="__main__":
multip()
但是您在Manager上要做的工作比您需要做的更多。 map
将工作程序的返回值传递回父进程(并处理分块)。所以你可以做:
from multiprocessing import Pool, Manager
def test_func(link):
return "i scraped stuff from " + link
def multip():
links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
pool = Pool(processes=3)
scrapedinfo = pool.map(test_func, links)
print(scrapedinfo)
if __name__=="__main__":
multip()
并且避免对笨拙的列表代理进行额外的处理。