hadoop distcp通过java导致NoClassDefFoundError:无法初始化类com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

问题描述 投票:0回答:2

我正在尝试使用 Hadoop Java 库在我的 hadoop 集群上运行

distcp
命令,将内容从 HDFS 移动到 Google Cloud Bucket。我收到错误
NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

下面是我的java代码:

import com.google.gson.JsonArray;
import com.google.gson.JsonElement;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.tools.DistCp;
import org.apache.hadoop.tools.DistCpOptions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class HadoopHelper {

    private static Logger logger = LoggerFactory.getLogger(HadoopHelper.class);

    private static final String FS_DEFAULT_FS = "fs.defaultFS";

    private final Configuration conf;

    public HadoopHelper(String hadoopUrl) {
        conf = new Configuration();
        conf.set(FS_DEFAULT_FS, "hdfs://" + hadoopUrl);
    }

    public void distCP(JsonArray files, String target) {

        try {
            List<Path> srcPaths = new ArrayList<>();

            for (JsonElement file : files) {
                String srcPath = file.getAsString();
                srcPaths.add(new Path(srcPath));
            }

            DistCpOptions options = new DistCpOptions.Builder(
                    srcPaths,
                    new Path("gs://" + target)
            ).build();

            logger.info("Using distcp to copy {} to gs://{}", files, target);

            this.conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem");
            this.conf.set("fs.gs.auth.service.account.email", "[email protected]");
            this.conf.set("fs.gs.auth.service.account.keyfile", "config/my-svc-account-keyfile.p12");
            this.conf.set("fs.gs.project.id", "my-gcp-project");


            DistCp distCp = new DistCp(this.conf, options);
            Job job = distCp.execute();

            job.waitForCompletion(true);

            logger.info("Distcp operation success. Exiting");
        } catch (Exception e) {
            logger.error("Error while trying to execute distcp", e);
            logger.error("Distcp operation failed. Exiting");
            throw new IllegalArgumentException("Distcp failed");
        }
    }

    public void createDirectory() throws IOException {
        FileSystem fileSystem = FileSystem.get(this.conf);
        fileSystem.mkdirs(new Path("/user/newfolder"));
        logger.info("Done");
    }
}

我在

pom.xml
中添加了以下依赖项:

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-distcp</artifactId>
        <version>3.3.1</version>
    </dependency>
    <dependency>
        <groupId>com.google.cloud.bigdataoss</groupId>
        <artifactId>gcs-connector</artifactId>
        <version>hadoop3-2.2.4</version>
    </dependency>
    <dependency>
        <groupId>com.google.cloud.bigdataoss</groupId>
        <artifactId>util</artifactId>
        <version>2.2.4</version>
    </dependency>

如果我在集群本身上运行 distcp 命令,如下所示:

hadoop distcp /user gs://my_bucket_name/

distcp 操作生效,内容被复制到云桶中。

java hadoop hdfs
2个回答
0
投票

您是否已将 jar 添加到 hadoop 的类路径中?

将连接器 jar 添加到 Hadoop 的类路径 将连接器 jar 放在 HADOOP_COMMON_LIB_JARS_DIR 目录中应该足以让 Hadoop 加载该 jar。或者,为了确保 jar 已加载,您可以将 HADOOP_CLASSPATH=$HADOOP_CLASSPATH: 添加到 Hadoop 配置目录中的 hadoop-env.sh 中。

这需要在这行代码之前对 DisctCp conf(在您的代码中

this.conf
)完成:

this.conf.set("HADOOP_CLASSPATH","$HADOOP_CLASSPATH:/tmp/gcs-connector-latest-hadoop2.jar")
DistCp distCp = new DistCp(this.conf, options);

如果有帮助的话,有一个故障排除部分


0
投票

我遇到了同样的问题,通过将此配置添加到 Spark 会话来修复

sc.hadoopConfiguration.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

您可以在此链接

阅读更多内容
© www.soinside.com 2019 - 2024. All rights reserved.