我正在使用 MLflow 来跟踪我的实验。我使用 S3 存储桶作为工件存储。为了访问它,我想使用代理工件访问,如文档中所述,但这对我不起作用,因为它在本地查找凭据(但服务器应该处理这个)。
如文档中所述,我希望在本地,我不需要指定我的 AWS 凭证,因为服务器会为我处理这个问题。来自文档:
这消除了允许最终用户直接路径访问远程对象存储(例如,s3、adls、gcs、hdfs)以进行工件处理的需要,并且消除了最终用户提供访问凭据以与底层对象存储。
每当我在我的机器上运行实验时,我都会遇到以下错误:
botocore.exceptions.NoCredentialsError: Unable to locate credentials
所以错误是本地的。但是,这种情况不应该发生,因为服务器应该处理身份验证,而不是我需要在本地存储我的凭据。另外,我希望我什至不需要本地库
boto3
。
我知道我需要创建一个新的实验,因为现有的实验可能仍然使用这个SO答案以及docs中的注释中提出的不同工件位置。创建新实验并没有解决我的错误。每当我运行实验时,我都会在控制台中收到明确的日志来验证这一点:
INFO mlflow.tracking.fluent: Experiment with name 'test' does not exist. Creating a new experiment.
服务器在 kubernetes pod 上运行,配置如下:
mlflow server \
--host 0.0.0.0 \
--port 5000 \
--backend-store-uri postgresql://user:pw@endpoint \
--artifacts-destination s3://my_bucket/artifacts \
--serve-artifacts \
--default-artifact-root s3://my_bucket/artifacts \
如果我将端口转发到本地计算机,我就可以看到 mlflow UI。由于我上面发送的错误,我还看到实验运行失败。
我的代码失败的相关部分是模型的日志记录:
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("test2)
...
# this works
mlflow.log_params(hyperparameters)
model = self._train(model_name, hyperparameters, X_train, y_train)
y_pred = model.predict(X_test)
self._evaluate(y_test, y_pred)
# this fails with the error from above
mlflow.sklearn.log_model(model, "artifacts")
我可能忽略了一些事情。是否需要在本地表明我想要使用代理人工访问?如果是,我该怎么做?有什么我错过的吗?
File /dir/venv/lib/python3.9/site-packages/mlflow/models/model.py", line 295, in log
mlflow.tracking.fluent.log_artifacts(local_path, artifact_path)
File /dir/venv/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 726, in log_artifacts
MlflowClient().log_artifacts(run_id, local_dir, artifact_path)
File /dir/venv/lib/python3.9/site-packages/mlflow/tracking/client.py", line 1001, in log_artifacts
self._tracking_client.log_artifacts(run_id, local_dir, artifact_path)
File /dir/venv/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 346, in log_artifacts
self._get_artifact_repo(run_id).log_artifacts(local_dir, artifact_path)
File /dir/venv/lib/python3.9/site-packages/mlflow/store/artifact/s3_artifact_repo.py", line 141, in log_artifacts
self._upload_file(
File /dir/venv/lib/python3.9/site-packages/mlflow/store/artifact/s3_artifact_repo.py", line 117, in _upload_file
s3_client.upload_file(Filename=local_file, Bucket=bucket, Key=key, ExtraArgs=extra_args)
File /dir/venv/lib/python3.9/site-packages/boto3/s3/inject.py", line 143, in upload_file
return transfer.upload_file(
File /dir/venv/lib/python3.9/site-packages/boto3/s3/transfer.py", line 288, in upload_file
future.result()
File /dir/venv/lib/python3.9/site-packages/s3transfer/futures.py", line 103, in result
return self._coordinator.result()
File /dir/venv/lib/python3.9/site-packages/s3transfer/futures.py", line 266, in result
raise self._exception
File /dir/venv/lib/python3.9/site-packages/s3transfer/tasks.py", line 139, in __call__
return self._execute_main(kwargs)
File /dir/venv/lib/python3.9/site-packages/s3transfer/tasks.py", line 162, in _execute_main
return_value = self._main(**kwargs)
File /dir/venv/lib/python3.9/site-packages/s3transfer/upload.py", line 758, in _main
client.put_object(Bucket=bucket, Key=key, Body=body, **extra_args)
File /dir/venv/lib/python3.9/site-packages/botocore/client.py", line 508, in _api_call
return self._make_api_call(operation_name, kwargs)
File /dir/venv/lib/python3.9/site-packages/botocore/client.py", line 898, in _make_api_call
http, parsed_response = self._make_request(
File /dir/venv/lib/python3.9/site-packages/botocore/client.py", line 921, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File /dir/venv/lib/python3.9/site-packages/botocore/endpoint.py", line 119, in make_request
return self._send_request(request_dict, operation_model)
File /dir/venv/lib/python3.9/site-packages/botocore/endpoint.py", line 198, in _send_request
request = self.create_request(request_dict, operation_model)
File /dir/venv/lib/python3.9/site-packages/botocore/endpoint.py", line 134, in create_request
self._event_emitter.emit(
File /dir/venv/lib/python3.9/site-packages/botocore/hooks.py", line 412, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File /dir/venv/lib/python3.9/site-packages/botocore/hooks.py", line 256, in emit
return self._emit(event_name, kwargs)
File /dir/venv/lib/python3.9/site-packages/botocore/hooks.py", line 239, in _emit
response = handler(**kwargs)
File /dir/venv/lib/python3.9/site-packages/botocore/signers.py", line 103, in handler
return self.sign(operation_name, request)
File /dir/venv/lib/python3.9/site-packages/botocore/signers.py", line 187, in sign
auth.add_auth(request)
File /dir/venv/lib/python3.9/site-packages/botocore/auth.py", line 407, in add_auth
raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
问题是服务器运行在错误的运行参数上,需要删除
--default-artifact-root
或将其设置为 mlflow-artifacts:/
。
来自
mlflow server --help
:
--default-artifact-root URI Directory in which to store artifacts for any
new experiments created. For tracking server
backends that rely on SQL, this option is
required in order to store artifacts. Note that
this flag does not impact already-created
experiments with any previous configuration of
an MLflow server instance. By default, data
will be logged to the mlflow-artifacts:/ uri
proxy if the --serve-artifacts option is
enabled. Otherwise, the default location will
be ./mlruns.
@bk_ 的答案帮助了我。我最终使用以下命令来让我的跟踪服务器通过代理连接运行以进行工件存储:
mlflow server \
--backend-store-uri postgresql://postgres:postgres@postgres:5432/mlflow \
--default-artifact-root mlflow-artifacts:/ \
--serve-artifacts \
--host 0.0.0.0
遇到同样的问题并且接受的答案似乎并不能解决我的问题。
删除或设置
mlflow-artifacts
而不是 s3
对我来说都不起作用。此外,它给了我一个错误,因为我有一个远程backend-store-uri
,我需要在运行mlflow服务器时设置default-artifact-root
。
我是如何解决这个问题的,我发现这个错误是不言自明的,它指出无法找到凭证的原因是下面的 mlflow 使用 boto3 来完成所有交易。由于我已经在
.env
中设置了环境变量,因此仅加载文件对我来说就足够了并解决了问题。如果您有类似的情况,那么只需在启动 mlflow 服务器之前运行以下命令即可,
set -a
source .env
set +a
这将加载环境变量,您就可以开始了。
注:
我想知道如何在 @Roby 提供的示例中建立与 S3 服务的连接。因此,如果不从本地提供服务,您仍然需要添加
artifacts-destination
,但与 default-artifact-root
结合使用:
mlflow server
--host 0.0.0.0
--port 5000
--serve-artifacts
--backend-store-uri postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}
--artifacts-destination s3://bucket
--default-artifact-root mlflow-artifacts:/artifacts
客户端现在只需设置跟踪 URI(例如
mlflow.set_tracking_uri("http://<workstation_name>:5000")
)即可跟踪参数并将工件记录到 MLflow 服务器和 S3 (minIO)。跟踪和 S3 服务都在远程计算机上运行。
另外:无需设置
MLFLOW_S3_ENDPOINT_URL
环境变量。
注意:更改这些设置时,您必须创建新的实验。现有的实验将无法适应。