Spark on 纱线模式以“退出状态：-100。诊断：在丢失节点上释放容器”

Question

我正在尝试使用最新的 EMR 加载包含 1TB 数据的数据库以在 AWS 上触发。而且运行时间太长，甚至 6 小时都没有完成，但运行 6h30m 后，我收到一些错误，宣布 Container 在 lost 节点上释放，然后作业失败。日志是这样的：

16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node

我非常确定我的网络设置有效，因为我尝试在较小的表上的相同环境中运行此脚本。

另外，我知道有人在 6 个月前发布了一个问题，询问同样的问题：spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released 但我仍然必须问，因为没有人回答这个问题。

Answer 1

看起来其他人也有同样的问题，所以我只是发布答案而不是写评论。我不确定这是否能解决问题，但这应该是一个想法。

如果您使用竞价实例，您应该知道，如果价格高于您输入的价格，竞价实例将被关闭，您就会遇到这个问题。即使您只是使用 Spot 实例作为从属实例。所以我的解决方案是不使用任何现货实例来进行长期运行的作业。

另一个想法是将作业分割成许多独立的步骤，这样您就可以将每个步骤的结果保存为 S3 上的文件。如果发生任何错误，只需从缓存文件的那一步开始即可。

Answer 2

是动态分配内存吗？我遇到了类似的问题，我通过计算执行器内存、执行器核心和执行器进行静态分配来修复它。尝试在 Spark 中对巨大工作负载进行静态分配。

Answer 3

这意味着您的 YARN 容器已关闭，要调试发生的情况，您必须阅读 YARN 日志，使用官方 CLI

yarn logs -applicationId

或者随意使用并为我的项目做出贡献 https://github.com/ebuildy/yoga作为网络应用程序的 YARN 查看器。

您应该会看到很多 Worker 错误。

Answer 4

我遇到了同样的问题。我在DZone上的这篇文章中找到了一些线索：
https://dzone.com/articles/some-lessons-of-spark-and-memory-issues-on-emr

这个问题是通过增加 DataFrame 分区的数量（在本例中从 1,024 增加到 2,048）来解决的。这减少了每个分区所需的内存。

所以我尝试增加 DataFrame 分区的数量来解决我的问题。

Answer 5

AWS 已将此作为常见问题解答

对于电子病历： https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/

对于涂胶作业： https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/

Answer 6

亚马逊提供了他们的解决方案，是通过资源分配来处理的，没有从用户角度的处理方法

Answer 7

就我而言，我们使用带有 2 个 Pre-Emptible（默认）辅助工作人员的 GCP Dataproc 集群。

对于短期作业来说这不是问题，因为主要和次要工人都很快完成。

但是，对于长时间运行的作业，我们观察到所有主要工作人员相对于次要工作人员更快地完成分配的任务。

由于可抢占的性质，分配给辅助工作人员的任务运行 3 小时后容器就会丢失。因此，导致

Container losts

错误。

我建议不要使用辅助工作者来执行任何长时间运行的工作。

Answer 8

检查托管容器的节点的 CloudWatch 指标和实例状态日志：该节点由于磁盘利用率高而被标记为不健康，或者存在硬件问题。

在前一种情况下，您应该在 AWS EMR UI 中的“MR 不健康节点”指标中看到非零值，而在后一种情况下，您应该在“MR 丢失节点”指标中看到非零值。请注意，磁盘利用率阈值是使用

yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage

设置在 YARN 中配置的，默认情况下为

90%

。与容器日志类似，AWS EMR 导出带有实例状态快照的日志，其中包含大量有用信息，例如磁盘利用率、CPU 利用率、内存利用率和堆栈跟踪到 S3，因此请查看它们。要查找节点的 EC2 实例 ID，请将容器日志中的 IP 地址与 AWS EMR UI 中的 ID 进行匹配。

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/                                                 
                           PRE containers/
                           PRE node/
                           PRE steps/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/ 
                           PRE applications/                                                                                                                                                          
                           PRE daemons/                                                                                                                                                               
                           PRE provision-node/                                                                                                                                                        
                           PRE setup-devices/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/daemons/instance-state/
2023-09-24 13:13:33        748 console.log-2023-09-24-12-08.gz
2023-09-24 13:18:34      55742 instance-state.log-2023-09-24-12-15.gz
...
2023-09-24 17:33:58      60087 instance-state.log-2023-09-24-16-30.gz
2023-09-24 17:54:00      66614 instance-state.log-2023-09-24-16-45.gz
2023-09-24 18:09:01      60932 instance-state.log-2023-09-24-17-00.gz

cat /tmp/instance-state.log-2023-09-24-16-30.gz
...
# amount of disk free
df -h
Filesystem        Size  Used Avail Use% Mounted on
...
/dev/nvme0n1p1     10G  5.7G  4.4G  57% /
/dev/nvme0n1p128   10M  3.8M  6.2M  38% /boot/efi
/dev/nvme1n1p1    5.0G   83M  5.0G   2% /emr
/dev/nvme1n1p2    1.8T  1.7T  121G  94% /mnt
/dev/nvme2n1      1.8T  1.7T  120G  94% /mnt1
...

欲了解更多信息，请参阅以下资源。

如何解决 Amazon EMR 中的“退出状态：-100。诊断：在lost节点上释放容器”错误？，AWS 知识中心
集群上不健康的节点讨论，StackOverflow。
NodeManager，YARN 文档。
查看日志文件、AWS EMR 文档。

Spark on 纱线模式以“退出状态：-100。诊断：在丢失节点上释放容器”

问题描述投票：0回答：8

8个回答

最新问题

Spark on 纱线模式以“退出状态：-100。诊断：在*丢失*节点上释放容器”

问题描述 投票：0回答：8

8个回答

最新问题

Spark on 纱线模式以“退出状态：-100。诊断：在丢失节点上释放容器”

问题描述投票：0回答：8