AWS EMR - 因错误而终止 在主实例上应用程序预置失败

问题描述 投票:0回答:4

我正在配置 EMR 集群 emr-5.30.0。我使用 Terraform 运行此程序,并在 AWS CONSOLE 上出现以下错误,因为它失败了。

Amazon EMR 集群 j-11I5FOBxxxxxx 已于 UTC 时间 2020 年 10 月 26 日 19:51 终止,并出现错误,原因为 BOOTSTRAP_FAILURE。

我没有任何引导步骤。我也无法查看任何日志来了解发生了什么。日志 URI 为空,并且由于已终止而无法通过 SSH 连接到集群。

任何指点将不胜感激?

提供AWS-CLI-EXPORT输出:

aws emr create-cluster --auto-scaling-role EMR_AutoScaling_DefaultRole --applications Name=Spark --tags 'Account=xxx' 'Function=xxx' 'Repository=' '[电子邮件受保护]' 'Slack=xxx' 'Builder=xxx' '环境=xxx' '服务=xxx xxx xxx' '团队=xxx' '名称=xxx-xxx-xxx' --ebs-root-volume-size 100 --ec2-attributes '{"KeyName ":"xxx","AdditionalSlaveSecurityGroups":[""],"InstanceProfile":"EMR_EC2_DefaultRole","ServiceAccessSecurityGroup":"sg-xxx","SubnetId":"subnet-xxx","EmrManagedSlaveSecurityGroup":"sg- xxx","EmrManagedMasterSecurityGroup":"sg-xxx","AdditionalMasterSecurityGroups":[""]}' --service-role EMR_DefaultRole --release-label emr-5.30.0 --name 'xxx-xxx-xxx' - -instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":4} ]},"InstanceGroupType":"MASTER","InstanceType":"m5.2xlarge","名称":""},{"InstanceCount":2,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification" :{"SizeInGB":40,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"CORE","InstanceType":"m5.2xlarge","名称":"" }]' --configurations '[{"Classification":"hadoop-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":" /usr/bin/python3","JAVA_HOME":"/usr/lib/jvm/java-1.8.0"}}]},{"分类":"spark-env","属性":{},"配置":[{"分类":"导出","属性":{"PYSPARK_PYTHON":"/usr/bin/python3","JAVA_HOME":"/usr/lib/jvm/java-1.8.0"} }]}]' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region eu-west-2

amazon-web-services amazon-emr terraform-provider-aws
4个回答
4
投票

问题是由于JAVA_HOME设置不正确造成的。

JAVA_HOME":"/usr/lib/jvm/java-1.8.0"

解决方案:检查 S3 中的日志:provision-node/reports,它应该告诉您哪个引导步骤失败...


0
投票

尝试更改实例类型并尝试在不同的可用区运行它,看看问题是否仍然存在。


0
投票

在md5.xlarge上使用emr-6.2.0构建集群,这是JAVA_HOME: /usr/lib/jvm/java-1.8.0-amazon-corretto.x86_64


0
投票

我收到此错误。

Terminated with errors
On the master instance (i-00f80ba5bed7f7aac), application provisioning failed

遵循本指南。

https://repost.aws/knowledge-center/emr-application-provisioning-error

我在应用程序阶段(即 puppet 的 stderr 日志)中发现了此错误。

2023-11-19 04:15:04,560 INFO BigtopPuppeteer: setConfig reportdir: /var/log/provision-node/reports/0/82133c07-d6c4-4e8a-bbe3-1ec80befd943
2023-11-19 04:15:07,695 INFO BigtopPuppeteer: Set 1 puppet configs
2023-11-19 04:16:28,670 INFO NodeProvisionerWorkflow: Starting processing puppet report for failure message
2023-11-19 04:16:28,676 INFO DefaultPuppetYamlReportProcessor: Found latest puppet report file /var/log/provision-node/reports/0/82133c07-d6c4-4e8a-bbe3-1ec80befd943/ip-172-31-76-184.ec2.internal/202311190416.yaml
2023-11-19 04:16:28,804 INFO NodeProvisionerWorkflow: Finished processing puppet report for failure message
2023-11-19 04:16:28,848 ERROR Program: Encountered a problem while provisioning
com.amazonaws.emr.node.provisioner.puppet.api.PuppetException: Execution of '/bin/yum -d 0 -e 0 -y --releasever=2 list kernel-devel-4.14.327-246.539.amzn2.x86_64' returned 1: Error: No matching Packages to list
    at com.amazonaws.emr.node.provisioner.workflow.NodeProvisionerWorkflow.work(NodeProvisionerWorkflow.java:160) ~[node-provisioner-2.52.0.jar:?]
    at com.amazonaws.emr.node.provisioner.Program.main(Program.java:31) [node-provisioner-2.52.0.jar:?]

当我查找当前内核信息时。

[hadoop@ip-172-31-70-193 stubs]$ cat /proc/version
Linux version 4.14.327-246.539.amzn2.x86_64 (mockbuild@ip-10-0-61-37) (gcc version 7.3.1 20180712 (Red Hat 7.3.1-17) (GCC)) #1 SMP Sun Oct 22 17:17:17 UTC 2023

我在 yum 列表中列出了 kernel-devel 软件包,这就是我得到的。这意味着 yum 更新的软件包列表中的 kernel-devel 软件包的版本比安装的版本高。 EMR 的配置阶段尝试安装它知道应该存在的内核开发包。

[hadoop@ip-172-31-70-193 stubs]$ sudo yum check-update -y
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
14 packages excluded due to repository priority protections

containerd.x86_64                                                 1.7.2-1.amzn2.0.1                                              amzn2extra-docker
kernel.x86_64                                                     4.14.328-248.540.amzn2                                         amzn2-core       
kernel-devel.x86_64                                               4.14.328-248.540.amzn2                                         amzn2-core       
nspr.x86_64                                                       4.35.0-1.amzn2                                                 amzn2-core       
nss.x86_64                                                        3.90.0-2.amzn2.0.1                                             amzn2-core       
nss-softokn.x86_64                                                3.90.0-6.amzn2                                                 amzn2-core       

解决方案是不要添加任何其他可能会更新软件包列表并更新内核的存储库,这意味着我们需要重新启动,但我们不会在 EMR 计算机上重新启动。

#sudo amazon-linux-extras install epel -y  << don't add extras in the list.
#sudo yum update -y  << don't run this command as it actually installs updated packages.
sudo yum clean all
sudo yum check-update
© www.soinside.com 2019 - 2024. All rights reserved.