我试图了解 Spark 的 Catalyst 优化器如何选择最佳物理计划以及该过程中使用的成本函数是什么。
我确实了解它的作用以及它的使用方式,但我想知道优化器如何进行优化,并根据它生成不同的物理计划并选择最佳的计划,因为我找不到任何成本资源计划
如果您一般询问优化器,那么有许多有用的搜索结果可以解释这一点(例如phases,experimental extensions),可以在找到一些非常具体的自定义规则以及使用SparkExtension 注册它们以供一般用途。
质量自定义规则让 Spark 知道可以通过首先转换 uuid(与行 id 类似)来搜索通过连接两个长整型创建的 uuid 字符串,从而加快计算速度。这会停止表扫描并允许谓词下推。
那些不需要运行查询,只需要查询结构本身,与谓词下推等相同。选择的确切阶段取决于内置计划已经执行的优化级别,上面的自定义阶段发生在谓词下推逻辑执行之前放置,以便其结果可以被下推。
AQE 需要运行请参阅此处了解 AQE 如何工作的说明(复制如下):
/**
* A root node to execute the query plan adaptively. It splits the query plan into independent
* stages and executes them in order according to their dependencies. The query stage
* materializes its output at the end. When one stage completes, the data statistics of the
* materialized output will be used to optimize the remainder of the query.
*
* To create query stages, we traverse the query tree bottom up. When we hit an exchange node,
* and if all the child query stages of this exchange node are materialized, we create a new
* query stage for this exchange node. The new stage is then materialized asynchronously once it
* is created.
*
* When one query stage finishes materialization, the rest query is re-optimized and planned based
* on the latest statistics provided by all materialized stages. Then we traverse the query plan
* again and create more stages if possible. After all stages have been materialized, we execute
* the rest of the plan.
*/
您“计划”执行的优化将取决于您需要做什么。不过,我建议您认真考虑一下付出的努力是否值得。