使用 tf.layers.TextVectorization 预处理大型数据集会出现内存错误

问题描述 投票:0回答:1

我有大约 300k 个文件,即大约 9GB 的医学文献。

我的目标是确定数据集中所有标记的频率并将它们序列化为 csv 文件(标记、频率)。

为了实现这一目标,我将

layers.TextVectorization
output_mode='count'
一起使用。

适应巨大的数据集进展顺利,但在检索词汇时我得到:

2024-04-20 22:38:56.832518: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-20 22:38:56.833545: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Discoverd 323719 files
Finished adapting
Traceback (most recent call last):
  File "(file_location_of_source_code)", line 55, in <module>
    inverse_vocab = vectorize_layer.get_vocabulary()
  File "D:\Anaconda\envs\src\lib\site-packages\keras\layers\preprocessing\text_vectorization.py", line 487, in get_vocabulary
    return self._lookup_layer.get_vocabulary(include_special_tokens)
  File "D:\Anaconda\envs\src\lib\site-packages\keras\layers\preprocessing\index_lookup.py", line 385, in get_vocabulary
    self._tensor_vocab_to_numpy(vocab),
  File "D:\Anaconda\envs\src\lib\site-packages\keras\layers\preprocessing\string_lookup.py", line 416, in _tensor_vocab_to_numpy
    [tf.compat.as_text(x, self.encoding) for x in vocabulary]
  File "D:\Anaconda\envs\src\lib\site-packages\keras\layers\preprocessing\string_lookup.py", line 416, in <listcomp>
    [tf.compat.as_text(x, self.encoding) for x in vocabulary]
MemoryError

Process finished with exit code 1

我的代码中的相关部分:

files_root = pathlib.Path(r"directoryname")
files = tf.data.TextLineDataset.list_files(str(files_root/'*'))
text_ds = tf.data.TextLineDataset(files).filter(lambda x: tf.cast(tf.strings.length(x), bool))
# by default this tokenizes elements by new line
# we use dataset class bec it is for large amounts of data

# Converting the vocab to integer indexes, wrt their frequency
vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    output_mode='count')
# vectorize_layer.adapt(text_ds.batch(1024))
print(f"Discoverd {len(files)} files")
vectorize_layer.adapt(text_ds.batch(1024))
print("Finished adapting")
inverse_vocab = vectorize_layer.get_vocabulary() #<----ERROR 
print("Vocabulary Retrieved, with the len:")
size_vocab = len(inverse_vocab) 
print(size_vocab)

此外,我计划组合所有序列的频率数组,该方法适用于相当较小的数据集:

text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()
print("Finished vectorization")
it = text_vector_ds.as_numpy_iterator()
freq_arr = None
for i, entry in enumerate(text_vector_ds.as_numpy_iterator()):
    if i == 0:
        freq_arr = np.zeros(len(entry))
        freq_arr += entry.astype(int)
    else:
        freq_arr += entry.astype(int)

这个问题的解决方案是什么?我需要词汇表才能将实际标记映射到它们的频率。

我对tensorflow和keras相当陌生,欢迎对我的方法提出任何建议或批评。我确实需要一些有关处理大型数据集的指导。我的最终目标是将它们输入神经网络(更具体地说,是一些skipgram)。 非常感谢。

python tensorflow keras nlp vectorization
1个回答
0
投票

不幸的是,您很可能需要更多的内存来完全处理您的数据集。

据我所知:

vectorize_layer.get_vocabulary()
优化得不是很好,在某些时候它保留了词汇表的多个副本。而且看来你的词汇量很大。

对于问题的第一部分 - 考虑在

max_tokens
中使用
TextVectorization
。这将只保留顶级代币。

对于第二部分,我建议避免取消批处理,并考虑批量处理映射的数据。也许是这样的:

text_count = text_ds.batch(1024).prefetch(1).map(vectorize_layer)
for batch in text_count:
  freq_batch = batch.numpy().sum(axis=0)
  if i == 0:
      freq_arr = np.zeros(len(freq_batch))
      freq_arr += freq_batch.astype(int)
  else:
      freq_arr += freq_batch.astype(int)
© www.soinside.com 2019 - 2024. All rights reserved.