具有大型语料库的Python Gensim LDAMallet CalledProcessError(小型语料库运行良好)

问题描述 投票:1回答:1

当我在大约1600万个文档的完整语料库中运行Gensim LDAMallet模型时,我得到一个CalledProcessError“非零退出状态1”错误。有趣的是,如果我在约160,000个文档的测试语料库中运行完全相同的代码,则代码运行完全正常。由于它在我的小语料库上工作正常,我倾向于认为代码很好,但我不确定还有什么/可能导致这个错误......

我已经尝试按照建议的here编辑mallet.bat文件,但无济于事。我也仔细检查了路径,但这不应该是一个问题,因为它适用于较小的语料库。

id2word = corpora.Dictionary(lists_of_words)
corpus =[id2word.doc2bow(doc) for doc in lists_of_words]
num_topics = 30
os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8/'})
mallet_path = r'C:/mallet-2.0.8/bin/mallet'
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)

这是完整的追溯和错误:

  File "<ipython-input-57-f0e794e174a6>", line 8, in <module>
    ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 132, in __init__
    self.train(corpus)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 273, in train
    self.convert_input(corpus, infer=False)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 262, in convert_input
    check_output(args=cmd, shell=True)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\utils.py", line 1918, in check_output
    raise error

CalledProcessError: Command 'C:/mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.txt --output C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.mallet' returned non-zero exit status 1.
python gensim lda mallet
1个回答
1
投票

我很高兴你找到我的帖子,我很抱歉它不适合你。我发现该错误的原因主要是Java没有安装属性且路径没有调用环境变量。

由于您的代码在较小的数据集上运行,因此我首先查看您的数据。 Mallet很挑剔,因为它只接受最干净的数据,它可能会触及null,标点符号或浮点数。

您是否采用了字典的样本大小,或者是否传递了整个数据集?

这基本上就是它正在做的事情:将句子翻译成单词 - 将单词转换为数字 - 然后计算频率如下:

[(3, 1), (13, 1), (37, 1)]

Word 3(“辅助”)出现1次。 Word 13(“付款”)出现1次。 Word 37(“帐户”)出现1次。

然后你的LDA会查看一个单词并根据字典中所有其他单词出现的频率进行评分,并且它会对整个字典进行评分,所以如果你让它看到数以百万计的单词就会崩溃真实快速。

这就是我实施mallet并缩小字典的方式,不包括词干或其他预处理步骤:

# we create a dictionary of all the words in the csv by iterating through
# contains the number of times a word appears in the training set.

dictionary = gensim.corpora.Dictionary(processed_docs[:])
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

# we want to throw out words that are so frequent that they tell us little about the topic 
# as well as words that are too infrequent >15 rows then keep just 100,000 words

dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# the words become numbers and are then counted for frequency
# consider a random row 4310 - it has 27 words word indexed 2 shows up 4 times
# preview the bag of words

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

os.environ['MALLET_HOME'] = 'C:\\mallet\\mallet-2.0.8'

mallet_path = 'C:\\mallet\\mallet-2.0.8\\bin\\mallet'

ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus, num_topics=20, alpha =.1, 
                                             id2word=dictionary, iterations = 1000, random_seed = 569356958)

此外,我将你的ldamallet分成一个单独的单元格,因为编译时间很慢,特别是在大小的数据集上。如果您仍然遇到错误,我希望这有助于让我知道:)

© www.soinside.com 2019 - 2024. All rights reserved.