在.h5文件中随机读取元素而不加载整个矩阵

Question

我有一个巨大的训练数据集，无法容纳在RAM中。我试图在堆栈中加载随机批量图像而不加载整个.h5。我的方法是创建一个索引列表并对它们进行洗牌，而不是整个.h5文件。让我们说：

a = np.arange(2000*2000*2000).reshape(2000, 2000, 2000)
idx = np.random.randint(2000, size = 800) #so that I only need to shuffle this idx at the end of epoch

# create this huge data 32GBs > my RAM
with h5py.File('./tmp.h5', 'w') as f:
     tmp = f.create_dataset('a', (2000, 2000, 2000))
     tmp[:] = a

# read it
with h5py.File('./tmp.h5', 'r') as f:
     tensor = f['a'][:][idx] #if I don't do [:] there will be error if I do so it will load whole file which I don't want

有人有解决方案吗？

Answer 1

感谢@ max9111，我建议如何解决它：

batch_size = 100 
idx = np.arange(2000)
# shuffle
idx = np.random.shuffle(idx)

由于constraint of h5py：

选择坐标必须按递增顺序给出

在阅读之前应该排序：

for step in range(epoch_len // batch_size):
     try:
          with h5py.File(path, 'r') as f:
               return f['img'][np.sort(idx[step * batch_size])], f['label'][np.sort(idx[step * batch_size])]
     except:
          raise('epoch finished and drop the remainder')

在.h5文件中随机读取元素而不加载整个矩阵

问题描述投票：0回答：1

1个回答

最新问题

在.h5文件中随机读取元素而不加载整个矩阵

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1