EMR 上的 GCS 连接器因 java.lang.ClassNotFoundException 失败

问题描述 投票:0回答:1

我已经创建了一个 emr 集群,其中包含有关如何从此处提供的 gcs 创建连接的说明,并继续运行 hadoop distcp 命令。 它一直失败并出现以下错误:

2023-07-25 12:00:40,113 INFO mapreduce.Job: Task Id : attempt_1690268608656_0012_m_000002_1, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2637) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3324) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3356) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3407) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3375) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:486) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:163) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:809) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:348) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2541) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2635) ... 17 more 2023-07-25 12:00:47,139 INFO mapreduce.Job: map 100% reduce 0% 2023-07-25 12:00:49,149 INFO mapreduce.Job: Job job_1690268608656_0012 failed with state FAILED due to: Task failed task_1690268608656_0012_m_000001 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2023-07-25 12:00:49,216 INFO mapreduce.Job: Counters: 12 Job Counters Failed map tasks=11 Killed map tasks=20 Launched map tasks=12 Other local map tasks=12 Total time spent by all maps in occupied slots (ms)=5936160 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=61835 Total vcore-milliseconds taken by all map tasks=61835 Total megabyte-milliseconds taken by all map tasks=189957120 Map-Reduce Framework CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 2023-07-25 12:00:49,218 ERROR tools.DistCp: Exception encountered java.io.IOException: DistCp failure: Job job_1690268608656_0012 has failed: Task failed task_1690268608656_0012_m_000001 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 at org.apache.hadoop.tools.DistCp.waitForJobCompletion(DistCp.java:230) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:185) at org.apache.hadoop.tools.DistCp.run(DistCp.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.tools.DistCp.main(DistCp.java:441) 2023-07-25 12:00:49,225 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system... 2023-07-25 12:00:49,226 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped. 2023-07-25 12:00:49,226 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

如果您需要更多相关信息,请发表评论。

我下载了最新的 gcs-connector jar 和 gcs 服务帐户 json 文件。 然后执行以下步骤进行手动设置:

使用以下依赖项更新 core-site.xml: 1.
  1. <property> <name>fs.AbstractFileSystem.gs.impl</name> <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value> </property> <property> <name>fs.gs.impl</name> <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value> <description>The FileSystem for gs: (GCS) uris.</description> </property> <property> <name>google.cloud.auth.service.account.json.keyfile</name> <value>/tmp/service_account.json</value> </property> <property> <name>google.cloud.auth.service.account.enable</name> <value>true</value> </property> <property> <name>fs.gs.status.parallel.enable</name> <value>true</value> </property>
使用 gcs_connector 位置更新了 hadoop_classpath 
  1. 在属性mapreduce.application.classpath下的mapred-site.xml中添加了gcs_connector jar位置
  2. 为 Spark 添加了以下属性:
  3. # The AbstractFileSystem for 'gs:' URIs spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS # Optional. Google Cloud Project ID with access to GCS buckets. # Required only for list buckets and create bucket operations. spark.hadoop.fs.gs.project.id= # Whether to use a service account for GCS authorization. Setting this # property to `false` will disable use of service accounts for authentication. spark.hadoop.google.cloud.auth.service.account.enable=true # The JSON keyfile of the service account used for GCS # access when google.cloud.auth.service.account.enable is true. spark.hadoop.google.cloud.auth.service.account.json.keyfile=/path/to/keyfile
我还使 gcs-connector jar 成为可执行文件,并尝试使用在 
google docs site

上找到的两个普通 jar,并尝试了最新的着色 jar。 此外,如果我运行 hadoop fs -ls gs://my_bucket 它会运行 file 并且类似的 hadoop fs -cp 工作正常。它只在地图缩减期间失败

amazon-web-services google-cloud-storage amazon-emr distcp s3distcp
1个回答
0
投票
gcs connector

gcs_service_account.json
仅在名称节点中设置,而不是在工作线程中设置。 在工人中也设置了一切。 值得学习的教训:如果需要使用额外的依赖项,则在设置时始终使用引导脚本
emr
    

© www.soinside.com 2019 - 2024. All rights reserved.