具有文本或spacy的多重处理

问题描述 投票:0回答:1

我正在尝试通过并行化文本来加速处理大量文本。当我从多重处理中使用Pool时,结果得到的文本语料库就空了。我不确定问题是否出在使用textacy或多处理范例的方式上?这是说明我的问题的示例:

import spacy
import textacy
from multiprocessing import Pool

texts_dict={
"key1":"First text 1."
,"key2":"Second text 2."
,"key3":"Third text 3."
,"key4":"Fourth text 4."
}

model=spacy.load('en_core_web_lg')

# this works

corpus = textacy.corpus.Corpus(lang=model)

corpus.add(tuple([value, {'key':key}],) for key,value in texts_dict.items())

print(corpus) # prints Corpus(4 docs, 8 tokens)
print([doc for doc in corpus])

# now the same thing with a worker pool returns empty corpus

corpus2 = textacy.corpus.Corpus(lang=model)

pool = Pool(processes=2) 
pool.map( corpus2.add, (tuple([value, {'key':key}],) for key,value in texts_dict.items()) )

print(corpus2) # prints Corpus(0 docs, 0 tokens)
print([doc for doc in corpus2])

# to make sure we get the right data into corpus.add
pool.map( print, (tuple([value, {'key':key}],) for key,value in texts_dict.items()) )

Textacy是基于spacy。 Spacy不支持多线程,但是应该可以在多个进程中运行。 https://github.com/explosion/spaCy/issues/2075

python multiprocessing spacy pool textacy
1个回答
0
投票

由于python进程在单独的内存空间中运行,因此您必须在池中的进程之间共享corpus对象。为此,您必须将corpus对象包装到可共享的类中,并在BaseManager实例中注册。这是重构代码使其起作用的方法:

#!/usr/bin/python3
from multiprocessing import Pool
from multiprocessing.managers import BaseManager

import spacy
import textacy


texts = {
    'key1': 'First text 1.',
    'key2': 'Second text 2.',
    'key3': 'Third text 3.',
    'key4': 'Fourth text 4.',
}


class PoolCorpus(object):

    def __init__(self):
        model = spacy.load('en_core_web_sm')
        self.corpus = textacy.corpus.Corpus(lang=model)

    def add(self, data):
        self.corpus.add(data)

    def get(self):
        return self.corpus


BaseManager.register('PoolCorpus', PoolCorpus)


if __name__ == '__main__':

    with BaseManager() as manager:
        corpus = manager.PoolCorpus()

        with Pool(processes=2) as pool:
            pool.map(corpus.add, ((v, {'key': k}) for k, v in texts.items()))

        print(corpus.get())

输出:

Corpus(4 docs, 16 tokens)
© www.soinside.com 2019 - 2024. All rights reserved.