Spark on 纱线模式以“退出状态:-100。诊断:在*丢失*节点上释放容器”

问题描述 投票:0回答:8

我正在尝试使用最新的 EMR 加载包含 1TB 数据的数据库以在 AWS 上触发。而且运行时间太长,甚至 6 小时都没有完成,但运行 6h30m 后,我收到一些错误,宣布 Container 在 lost 节点上释放,然后作业失败。日志是这样的:

16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node


另外,我知道有人在 6 个月前发布了一个问题,询问同样的问题:spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released 但我仍然必须问,因为没有人回答这个问题。

apache-spark hadoop-yarn emr


如果您使用竞价实例,您应该知道,如果价格高于您输入的价格,竞价实例将被关闭,您就会遇到这个问题。即使您只是使用 Spot 实例作为从属实例。所以我的解决方案是不使用任何现货实例来进行长期运行的作业。

另一个想法是将作业分割成许多独立的步骤,这样您就可以将每个步骤的结果保存为 S3 上的文件。如果发生任何错误,只需从缓存文件的那一步开始即可。


是动态分配内存吗?我遇到了类似的问题,我通过计算执行器内存、执行器核心和执行器进行静态分配来修复它。 尝试在 Spark 中对巨大工作负载进行静态分配。


这意味着您的 YARN 容器已关闭,要调试发生的情况,您必须阅读 YARN 日志,使用官方 CLI

yarn logs -applicationId
或者随意使用并为我的项目做出贡献作为网络应用程序的 YARN 查看器。

您应该会看到很多 Worker 错误。



这个问题是通过增加 DataFrame 分区的数量(在本例中从 1,024 增加到 2,048)来解决的。这减少了每个分区所需的内存。

所以我尝试增加 DataFrame 分区的数量来解决我的问题。




就我而言,我们使用带有 2 个 Pre-Emptible(默认)辅助工作人员的 GCP Dataproc 集群。



由于可抢占的性质,分配给辅助工作人员的任务运行 3 小时后容器就会丢失。因此,导致

Container losts



检查托管容器的节点的 CloudWatch 指标和实例状态日志:该节点由于磁盘利用率高而被标记为不健康,或者存在硬件问题。

在前一种情况下,您应该在 AWS EMR UI 中的“MR 不健康节点”指标中看到非零值,而在后一种情况下,您应该在“MR 丢失节点”指标中看到非零值。请注意,磁盘利用率阈值是使用

设置在 YARN 中配置的,默认情况下为
。与容器日志类似,AWS EMR 导出带有实例状态快照的日志,其中包含大量有用信息,例如磁盘利用率、CPU 利用率、内存利用率和堆栈跟踪到 S3,因此请查看它们。要查找节点的 EC2 实例 ID,请将容器日志中的 IP 地址与 AWS EMR UI 中的 ID 进行匹配。

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/                                                 
                           PRE containers/
                           PRE node/
                           PRE steps/

                           PRE applications/                                                                                                                                                          
                           PRE daemons/                                                                                                                                                               
                           PRE provision-node/                                                                                                                                                        
                           PRE setup-devices/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/daemons/instance-state/
2023-09-24 13:13:33        748 console.log-2023-09-24-12-08.gz
2023-09-24 13:18:34      55742 instance-state.log-2023-09-24-12-15.gz
2023-09-24 17:33:58      60087 instance-state.log-2023-09-24-16-30.gz
2023-09-24 17:54:00      66614 instance-state.log-2023-09-24-16-45.gz
2023-09-24 18:09:01      60932 instance-state.log-2023-09-24-17-00.gz

cat /tmp/instance-state.log-2023-09-24-16-30.gz
# amount of disk free
df -h
Filesystem        Size  Used Avail Use% Mounted on
/dev/nvme0n1p1     10G  5.7G  4.4G  57% /
/dev/nvme0n1p128   10M  3.8M  6.2M  38% /boot/efi
/dev/nvme1n1p1    5.0G   83M  5.0G   2% /emr
/dev/nvme1n1p2    1.8T  1.7T  121G  94% /mnt
/dev/nvme2n1      1.8T  1.7T  120G  94% /mnt1


© 2019 - 2024. All rights reserved.