我在将多进程与Python结合使用时遇到问题。我有两个代码。第一种方法运行良好,但是它在类之外,并且我需要将其放在类中,因为该类是更大程序的一部分。
工作代码(无课程)是:
import time
nlp =spacy.load("en_core_web_md")
start_time = time.time()
doc1 = nlp(str("Data Scientist"))
def get_paralel_similarity(item):
return doc1.similarity(nlp(item))
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
similarities = pool.map(get_paralel_similarity, list(df["jobs"]))
print("--- %s seconds ---" % (time.time() - start_time))
--- 10.971235990524292 seconds ---
您可以看到运行不到11秒。如果不进行多处理,则同一过程将花费1分钟。
问题是,doc1是动态的,我需要多次运行此代码。从这个意义上讲,我需要将其放在一个类中。我为此目标编写的代码是:
import time
import spacy
import warnings
import operator
from multiprocessing import Pool, set_start_method
from functools import partial
warnings.filterwarnings("ignore")
nlp =spacy.load("en_core_web_md")
def get_paralel_similarity(doc1, item):
return doc1.similarity(nlp(item))
class Matcher(object):
def __init__(self,**kwargs):
self.word = kwargs.get('word')
self.word_list = kwargs.get('word_list')
self.n = kwargs.get('n')
self.nlp = kwargs.get('nlp')
def get_top_similarities(self):
start_time = time.time()
pool = Pool()
similarities = {}
doc1 = nlp(str(self.word))
func = partial(get_paralel_similarity, doc1)
print("finished partial and started mapping")
similarities = pool.map(func, self.word_list)
pool.close()
pool.join()
print("--- %s seconds ---" % (time.time() - start_time))
return similarities
当我这样做:
import pandas as pd
df = pd.read_pickle("complete.pkl")
matcher = Matcher(word="Data Scientist",word_list=list(df["jobs"]),n=5,nlp=nlp)
similarity = matcher.get_top_similarities()
这需要永恒,并没有完成。如果您能帮助我理解问题出在哪里,我将不胜感激。
经过一些尝试,我意识到问题只是在functools部分,但尚未解决。
@ mkrieger建议切换到函数而不是类。这样,我就创建了:
import time
import spacy
import pandas as pd
from multiprocessing import Pool
nlp =spacy.load("en_core_web_md")
start_time = time.time()
def get_similarity(doc1, df):
def get_paralel_similarity(item):
return doc1.similarity(nlp(item))
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
similarities = pool.map(get_paralel_similarity, list(df["jobs"]))
print("--- %s seconds ---" % (time.time() - start_time))
df = pd.read_pickle("/Users/gadgethub/ontology/src/api_v2/others/emsi_ontology_complete.pkl")
get_similarity("data scientist",df)
问题是我现在遇到另一个错误:
AttributeError: Can't pickle local object 'get_similarity.<locals>.get_paralel_similarity'