Spark的Catalyst Optimizer如何选择物理计划?

问题描述 投票:0回答:1

我试图了解 Spark 的 Catalyst 优化器如何选择最佳物理计划以及该过程中使用的成本函数是什么。

enter image description here

我确实了解它的作用以及它的使用方式,但我想知道优化器如何进行优化,并根据它生成不同的物理计划并选择最佳的计划,因为我找不到任何成本资源计划

apache-spark apache-spark-sql query-optimization
1个回答
0
投票

如果您一般询问优化器,那么有许多有用的搜索结果可以解释这一点(例如phasesexperimental extensions),可以在找到一些非常具体的自定义规则以及使用SparkExtension 注册它们以供一般用途。

质量自定义规则让 Spark 知道可以通过首先转换 uuid(与行 id 类似)来搜索通过连接两个长整型创建的 uuid 字符串,从而加快计算速度。这会停止表扫描并允许谓词下推。

那些不需要运行查询,只需要查询结构本身,与谓词下推等相同。选择的确切阶段取决于内置计划已经执行的优化级别,上面的自定义阶段发生在谓词下推逻辑执行之前放置,以便其结果可以被下推。

AQE 需要运行请参阅此处了解 AQE 如何工作的说明(复制如下):

/** * A root node to execute the query plan adaptively. It splits the query plan into independent * stages and executes them in order according to their dependencies. The query stage * materializes its output at the end. When one stage completes, the data statistics of the * materialized output will be used to optimize the remainder of the query. * * To create query stages, we traverse the query tree bottom up. When we hit an exchange node, * and if all the child query stages of this exchange node are materialized, we create a new * query stage for this exchange node. The new stage is then materialized asynchronously once it * is created. * * When one query stage finishes materialization, the rest query is re-optimized and planned based * on the latest statistics provided by all materialized stages. Then we traverse the query plan * again and create more stages if possible. After all stages have been materialized, we execute * the rest of the plan. */
您“计划”执行的优化将取决于您需要做什么。不过,我建议您认真考虑一下付出的努力是否值得。

© www.soinside.com 2019 - 2024. All rights reserved.