将模型工件上传到 Google Cloud Storage 后训练管道失败

问题描述 投票:0回答:1

这是我的训练代码片段:

param_grid = {
"max_tokens" : [100],
"max_len" : [10],
"dropout" : [0.1],
}
gs_model = GridSearchCV(KerasClassifier(build_model), param_grid, cv=3, scoring='accuracy')
gs_model.fit(x_train, y_train, verbose = 1)
best_params = gs_model.best_params_
optimized_model = build_model(max_tokens = best_params["max_tokens"], max_len = best_params["max_len"], dropout = best_params["dropout"])
optimized_model.fit(x_train, y_train, epochs = 3, validation_split = 0.2, callbacks = tensorflow.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose = 1))
model_name = "/tmp/custom-model-test"
optimized_model.save(model_name)
print('saved model to ', model_name)
upload_from_directory(model_name, "[redacted Bucket name]", "custom-model-test")
try: 
    upload_blob("[redacted Bucket name]", "goback-custom-train/requirements.txt", "custom-model-test/requirements.txt")
except:
    print(traceback.format_exc())
    print('Upload failed')

成功上传到谷歌云存储。它利用 Keras 中的

model.save
,并将创建的目录保存到我的 Bucket,以及其中的
requirements.txt
文件。需要明确的是,运行上面的代码块后,会在
custom-model-test/
中创建一个目录
gs://[redacted Bucket name]
,其中包含内容
requirements.txt
tmp/
tmp/
里面是
keras-metadata.pb
saved_model.pb
variables/

我在我的 Kubeflow 管道中的以下代码块中运行此容器:

training_job_run_op = gcc_aip.CustomContainerTrainingJobRunOp(
    project = project,
    display_name = display_name,
    container_uri=training_container_uri,
    model_serving_container_image_uri=model_serving_container_image_uri,
    model_serving_container_predict_route = model_serving_container_predict_route,
    model_serving_container_health_route = model_serving_container_health_route,
    model_serving_container_ports = [8080],
    service_account = "[redacted service account]",
    machine_type = "n1-highmem-2",
    accelerator_type ="NVIDIA_TESLA_V100",
    staging_bucket = BUCKET_NAME)

出于某种原因,在训练并保存模型工件后(模型训练的日志表明它已成功完成)管道失败并显示日志:

" File "/opt/python3.7/lib/python3.7/site-packages/google/cloud/aiplatform/training_jobs.py", line 905, in _raise_failure "
" raise RuntimeError("Training failed with:\n%s" % self._gca_resource.error) "
"RuntimeError: Training failed with: "
"code: 5
"message: "There are no files under \"gs://[redacted Bucket name]/aiplatform-custom-training-2022-04-21-14:04:46.151/model\" to copy."
"

这里发生了什么?解决方法是什么?

google-cloud-platform google-cloud-storage google-cloud-ml google-cloud-vertex-ai
1个回答
0
投票

在评论中也暗示,似乎有两个桶,或者一个桶可能有两个地方。

错误指的是一个非常具体的路径,包括您的描述中未提及的时间戳。

一般来说可能有两个问题:

  1. 你试图从错误的地方阅读
  2. 您没有读取权限(可能没有或只写)
© www.soinside.com 2019 - 2024. All rights reserved.