当火花动态分配为true时，EMR群集显示太多执行程序

Question

我正在EMR 5.27.0中以集群模式运行spark作业。 EMR的动态火花分配属性设置为true。

现在，当我开始执行Spark作业甚至启动Spark Shell时，我可以看到许多在Spark UI中启动的执行程序。

为什么即使我仅使用spark-shell也会发生这种情况？

我尝试了多种方法，例如设置属性，例如spark.dynamicAllocation.initialExecutors = 1，但没有成功。

[请帮助我理解此行为。

Answer 1

当Spark从HDFS读取文件时，它将为单个输入拆分创建单个分区。输入拆分由用于读取此文件的Hadoop InputFormat设置。例如，如果您使用textFile（），则它将是Hadoop中的TextInputFormat，它将为您返回单个HDFS块的单个分区（但是分区之间的拆分将在行拆分而不是确切的拆分中完成），除非您有一个压缩的文本文件。如果压缩文件取决于压缩类型，则分区数会有所不同]

下面提到的很少有其他参数通常可用于RDD而非Dataframe。

spark.default.parallelism - For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

和

spark.sql.shuffle.partitions - controls the number of partitions for operations on DataFrames (default is 200)

一旦定义了分区数量，每个分区将由一个任务处理，并且每个任务都在执行程序实例上运行。通过动态分配，执行程序实例的数量由分区数量控制，分区数量可以在DAG执行的每个阶段改变。

如果在打开动态分配时要控制执行者的数量，则可以在spark默认配置文件中设置以下配置。

spark.dynamicAllocation.initialExecutors | spark.dynamicAllocation.minExecutors |   Initial number of executors to run if dynamic allocation is enabled.
spark.dynamicAllocation.maxExecutors     | infinity                              | Upper bound for the number of executors if dynamic allocation is enabled.
spark.dynamicAllocation.minExecutors     | 0                                     | Lower bound for the number of executors if dynamic allocation is enabled.

您应设置spark.dynamicAllocation.maxExecutors以控制可在EMR群集中配置的执行程序的最大数量。

有关EMR群集的默认配置，请参阅此处的文档-https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html

当火花动态分配为true时，EMR群集显示太多执行程序

问题描述投票：0回答：1

1个回答

最新问题

当火花动态分配为true时，EMR群集显示太多执行程序

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1