在keras.preprocessing.text中使用Tokenizer时内存不足

问题描述 投票:1回答:1

我想用keras建立一个RNN模型来对句子进行分类。

我尝试了以下代码:

docs = []
with open('all_dga.txt', 'r') as f:
    for line in f.readlines():
        dga_domain, _ = line.split(' ')
        docs.append(dga_domain)

t = Tokenizer()
t.fit_on_texts(docs)
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)

但得到了一个MemoryError。似乎我无法将所有数据加载到内存中。这是输出:

Traceback (most recent call last):
  File "test.py", line 11, in <module>
    encoded_docs = t.texts_to_matrix(docs, mode='count')
  File "/home/yurzho/anaconda3/envs/deepdga/lib/python3.6/site-packages/keras/preprocessing/text.py", line 273, in texts_to_matrix
    return self.sequences_to_matrix(sequences, mode=mode)
  File "/home/yurzho/anaconda3/envs/deepdga/lib/python3.6/site-packages/keras/preprocessing/text.py", line 303, in sequences_to_matrix
    x = np.zeros((len(sequences), num_words))
MemoryError

如果有人熟悉keras,请告诉我如何预处理数据集。

提前致谢!

python nlp keras classification rnn
1个回答
2
投票

因为错误发生在t.fit_on_texts(docs)上,所以你可以从t.texts_to_matrix(docs, mode='count')创建词汇表来安装文档没有问题。

因此,您可以批量转换文档

from keras.preprocessing.text import Tokenizer

t = Tokenizer()

with open('/Users/liling.tan/test.txt') as fin:
    for line in fin:      
        t.fit_on_texts(line.split()) # Fitting the tokenizer line-by-line.

M = []

with open('/Users/liling.tan/test.txt') as fin:
    for line in fin:
        # Converting the lines into matrix, line-by-line.
        m = t.texts_to_matrix([line], mode='count')[0]
        M.append(m)

但如果您的计算机无法处理内存中的数据量,您会看到稍后会遇到MemoryError

© www.soinside.com 2019 - 2024. All rights reserved.