Bitfusion Ubuntu 14 TensorFlow AMI因OOM错误而失败

问题描述 投票:0回答:1

使用"Bitfusion Ubuntu 14 TensorFlow" AMI,任何尝试使用大型Tensors执行操作,例如

sess.run(tf.argmax(y, 1), feed_dict={x: use_x})

use_x是一个28,000 tf.Tensor的浮子,结果

“资源耗尽:OOM”

错误。这使得AMI对我无法使用。

我有什么设置可以防止这种情况发生吗?

——————————

I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (512):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1024):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2048):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4096):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8192):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16384):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (32768):     Total Chunks: 1, Chunks in use: 0 56.8KiB allocated for chunks. 3.1KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (65536):     Total Chunks: 1, Chunks in use: 0 111.2KiB allocated for chunks. 4B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (131072):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (262144):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (524288):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1048576):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2097152):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4194304):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8388608):   Total Chunks: 2, Chunks in use: 0 23.73MiB allocated for chunks. 440.3KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16777216):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (33554432):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (67108864):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (134217728):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (268435456):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:656] Bin for 83.74MiB was 64.00MiB, Chunk State: 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0200 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0300 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0400 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a2400 of size 6144
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a3c00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a3d00 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a4a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a4b00 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023d6b00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023d6c00 of size 25088000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x703bc3c00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x703bc5c00 of size 12000000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704737700 of size 6144
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704738f00 of size 60160
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704747a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704747b00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749b00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749c00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749d00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749e00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749f00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70474a000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70474a100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70474a200 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704758600 of size 60160
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704767100 of size 76288
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779b00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779c00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779d00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779e00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779f00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a200 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a300 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a400 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a900 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477aa00 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477b700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477b800 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7047ad800 of size 12000000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x705f67a00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x705f69a00 of size 25088000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x707756a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082c8600 of size 6144
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082c9e00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082c9f00 of size 6144
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082e7400 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082e7500 of size 25088000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x709ad4500 of size 12000000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70a646000 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70a646d00 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70a678d00 of size 87808000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70fa36500 of size 3703905024
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x70474a300 of size 58112
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x70531f300 of size 12879616
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x707756b00 of size 12000000
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x7082cb700 of size 113920
I tensorflow/core/common_runtime/bfc_allocator.cc:689]      Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 35 Chunks of size 256 totalling 8.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 3328 totalling 9.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 4 Chunks of size 6144 totalling 24.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 4 Chunks of size 8192 totalling 32.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 2 Chunks of size 60160 totalling 117.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 76288 totalling 74.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 204800 totalling 600.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 12000000 totalling 34.33MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 25088000 totalling 71.78MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 87808000 totalling 83.74MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 3703905024 totalling 3.45GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 3.64GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats: 
Limit:                  3928915968
InUse:                  3903864320
MaxInUse:               3903864320
NumAllocs:                  418794
MaxAllocSize:           3703905024

W tensorflow/core/common_runtime/bfc_allocator.cc:270] ******************************************************************************xxxxxxxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 83.74MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:907] Resource exhausted: OOM when allocating tensor with shape[28000,1,28,28]

Traceback (most recent call last):
  File "tf_simple.py", line 173, in <module>
    evals = sess.run(tf.argmax(y, 1), feed_dict={x: use_x})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 343, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 567, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 640, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 662, in _do_call
    e.code)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[28000,1,28,28]
     [[Node: 1_conv_layer/kernel_logits/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](as_grid, 1_conv_layer/kernel_weights/W1/read)]]
     [[Node: ArgMax/_2316 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1481_ArgMax", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'1_conv_layer/kernel_logits/Conv2D', defined at:
  File "tf_simple.py", line 47, in <module>
    final_dropout=final_dropout)
  File "/home/ubuntu/mlcode/tf_utils.py", line 150, in make_ff_network
    layer_name)
  File "/home/ubuntu/mlcode/tf_utils.py", line 86, in _add_conv_layer
    kernel_logits = tf.nn.conv2d(input_tensor, weights, strides=[1, 1, 1, 1], padding='SAME') + biases
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 295, in conv2d
    data_format=data_format, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 694, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2154, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1154, in __init__
    self._traceback = _extract_stack()
amazon-web-services machine-learning gpu tensorflow ami
1个回答
1
投票

问题是AWS GPU上的内存限制大约为4GB,这对AMI来说不是问题:

Limit:                  3928915968

InUse:                  3903864320

MaxInUse:               3903864320

NumAllocs:                  418794

MaxAllocSize:           3703905024

内存限制为3.928GB,使用的内存为3.903GB,分配请求为0.083GB,超出了内存限制。在AWS上,您可以选择重新编写代码,使其可以在4GB限制内运行,在该代码段的仅CPU模式下运行,并使用系统内存(这当然会破坏使用GPU的目的),或者等待AWS引入具有更大内存的新GPU实例。

或者,您可以寻找另一个提供更多最新GPU的云提供商,例如Nimbix。

© www.soinside.com 2019 - 2024. All rights reserved.