tf.keras中的数据管道，带有tfrecords或numpy

Question

我想在Tensorflow 2.0的tf.keras中训练一个模型，其数据大于我的ram，但教程只展示了预定义数据集的示例。

我按照本教程：

Load Images with tf.data，我无法为numpy数组或tfrecords上的数据做这项工作。

这是将数组转换为tensorflow数据集的示例。我想要的是使这个工作多个numpy数组文件或多个tfrecords文件。

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
# Shuffle and slice the dataset.
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)

# Since the dataset already takes care of batching,
# we don't pass a `batch_size` argument.
model.fit(train_dataset, epochs=3)

Answer 1

如果你有tfrecords文件：

path = ['file1.tfrecords', 'file2.tfrecords', ..., 'fileN.tfrecords']
dataset = tf.data.Dataset.list_files(path, shuffle=True).repeat()
dataset = dataset.interleave(lambda filename: tf.data.TFRecordDataset(filename), cycle_length=len(path))
dataset = dataset.map(parse_function).batch()

parse_function处理解码和任何类型的扩充。

对于numpy数组，您可以从文件名列表或数组列表构建数据集。标签只是一个列表。或者在解析单个示例时可以从文件中获取它们。

path = #list of numpy arrays

要么

path = os.listdir(path_to files)

dataset = tf.data.Dataset.from_tensor_slices((path, labels))
dataset = dataset.map(parse_function).batch()

parse_function处理解码：

def parse_function(filename, label):  #Both filename and label will be passed if you provided both to from_tensor_slices
    f = tf.read_file(filename)
    image = tf.image.decode_image(f)) 
    image = tf.reshape(image, [H, W, C])
    label = label #or it could be extracted from, for example, filename, or from file itself 
    #do any augmentations here
    return image, label

要解码.npy文件，最好的方法是使用没有reshape或read_file的decode_raw，但首先使用np.load加载numpys：

paths = [np.load(i) for i in ["x1.npy", "x2.npy"]]
image = tf.reshape(filename, [2])

或尝试使用decode_raw

f = tf.io.read_file(filename)
image = tf.io.decode_raw(f, tf.float32)

然后将批量数据集传递给model.fit(dataset)。 TensorFlow 2.0允许对数据集进行简单迭代。无需使用迭代器。即使在1.x API的更高版本中，您也可以将数据集传递给.fit方法

for example in dataset:
    func(example)

tf.keras中的数据管道，带有tfrecords或numpy

问题描述投票：0回答：1

1个回答

最新问题

tf.keras中的数据管道，带有tfrecords或numpy

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1