我正在尝试使用下面的代码片段记录 Spark 模型。模型指标和参数保存在 ML 流运行中,但模型本身不会保存在工件下。但是,在同一环境中使用 model.sklearn.log_model() 记录 Scikit-learn 模型时,模型已成功保存。
环境: Databricks 10.4 LTS ML 集群
train, test = train_test_random_split(conf, data)
experiment_name = "/mlflow_experiments/debug_spark_model"
mlflow.set_experiment(experiment_name)
evaluator = BinaryClassificationEvaluator()
rf = RandomForestClassifier()
param_grid = (
ParamGridBuilder()
.addGrid(rf.numTrees,[15)
.addGrid(rf.maxDepth, [6])
.addGrid(
rf.minInstancesPerNode,
[7],
)
.build()
)
cv = CrossValidator(
estimator=rf,
estimatorParamMaps=param_grid,
evaluator=BinaryClassificationEvaluator(metricName="areaUnderROC"),
numFolds=10,
)
cv_model = cv.fit(train)
# best model
model = cv_model.bestModel
model_params_best = {
"numTrees": cv_model.getEstimatorParamMaps()[np.argmax(cv_model.avgMetrics)][
cv_model.bestModel.numTrees
],
"maxDepth": cv_model.getEstimatorParamMaps()[np.argmax(cv_model.avgMetrics)][
cv_model.bestModel.maxDepth
],
"minInstancesPerNode": cv_model.getEstimatorParamMaps()[
np.argmax(cv_model.avgMetrics)
][cv_model.bestModel.minInstancesPerNode],
}
model_metrics_best, artifacts_best, predicted_df_best = train_model(
model, train, test, evaluator
)
with mlflow.start_run(run_name="debug_run_1"):
run_id = mlflow.active_run().info.run_id
mlflow.log_params(model_params_best)
mlflow.log_metrics(model_metrics_best)
#debug 1
artifact_path = "best_model"
mlflow.spark.log_model(spark_model = model, artifact_path = artifact_path)
source = get_artifact_uri(run_id=run_id, artifact_path=artifact_path)
它给出以下错误。
com.databricks.mlflowdbfs.MlflowHttpException:statusCode=404 ReasonPhrase=[未找到] bodyMessage=[{"error_code":"RESOURCE_DOES_NOT_EXIST","message":"运行 未找到“bfe90fd5074f49c39a475b613d020cbf”。”}]
我很感激有关此错误的任何调试方向或解决方案。
找到了此错误或大多数与 mlflowdbfs 相关的错误的解决方法。
在 Databricks ML Runtime 集群中禁用
mlflowdbfs
可解决上述错误。另一种选择是使用普通的 Databricks 运行时集群。
import os
os.environ["DISABLE_MLFLOWDBFS"] = "true"