我正在尝试运行 Keras 官方代码示例“使用 Swin Transformers 进行图像分类”。该代码起初工作正常,但在我添加了一个 ModelCheckpoint 以将 hdf5 模型保存在 model.fit 方法的回调参数中{即model.fit(..., callbacks=[ModelCheckpoint(...)], ..., )},我收到以下错误 [ValueError: 无法创建数据集(名称已存在)]。这里的“名”指的是什么?我该如何解决这个问题?
我在本地设备(windows10,tensorflow2.8.0)和Google Colab(tensorflow2.8.2)上运行代码,都出现了上述错误。
完整的代码示例可以在这里找到 [https://keras.io/examples/vision/swin_transformers/] ,我的代码和代码示例之间的唯一区别是我为 ModelCheckpoint 添加了一行代码。添加代码的位置和报错信息如下所示
代码片段:
model = keras.Model(input, output)
model.compile(
loss=keras.losses.CategoricalCrossentropy(label_smoothing=label_smoothing),
optimizer=tfa.optimizers.AdamW(
learning_rate=learning_rate, weight_decay=weight_decay
),
metrics=[
keras.metrics.CategoricalAccuracy(name="accuracy"),
keras.metrics.TopKCategoricalAccuracy(5, name="top-5-accuracy"),
],
)
history = model.fit(
x_train,
y_train,
batch_size=batch_size,
epochs=num_epochs,
validation_split=validation_split,
# 👇 I added one line of code
callbacks = keras.callbacks.ModelCheckpoint('lowest_loss.hdf5', monitor='loss', verbose=0, save_best_only=True, save_weights_only=True)
)
这是我得到的错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-c96b13609516> in <module>()
18 validation_split=validation_split,
19 # 👇 I added one line of code
---> 20 callbacks = keras.callbacks.ModelCheckpoint('lowest_loss.hdf5', monitor='loss', verbose=0, save_best_only=True, save_weights_only=True)
21 )
2 frames
/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
/usr/local/lib/python3.7/dist-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
146 group = self.require_group(parent_path)
147
--> 148 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
149 dset = dataset.Dataset(dsid)
150 return dset
/usr/local/lib/python3.7/dist-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
135
136
--> 137 dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl)
138
139 if (data is not None) and (not isinstance(data, Empty)):
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5d.pyx in h5py.h5d.create()
ValueError: Unable to create dataset (name already exists)
很可能,当层名称在预训练模型和下游任务网络的命名空间中重复时,就会发生此错误。选择一个唯一的名称来调用下游任务网络的每一层可能很有用。为您创建的所有图层添加
name='some_unique_name'
以解决问题。
试试这个,它对我有用
for i in range(len(model.weights)):
model.weights[i]._handle_name = model.weights[i].name + "_" + str(i)
如果您使用TensorFlow 2.0或以上版本,您可以尝试将“.hdf5”文件更改为“.tf”。遇到同样的问题,我将文件扩展名更改如下:
save_dir = os.path.join(os.getcwd(), "save_models")
filepath = "cnn_cnn_weights.{epoch:02d}-{val_loss:.4f}--0fold.tf"
checkpoint = ModelCheckpoint(os.path.join(save_dir, filepath),
monitor="val_loss", verbose=1, save_best_only=False, mode='min')