如何使用AWS数据管道为Spark应用程序正确设置Google云存储

问题描述 投票:3回答:1

我正在设置集群步骤,以使用Amazon Data Pipeline运行Spark应用程序。我的工作是从S3读取数据,处理数据并将数据写入Google云存储。对于Google云存储,我正在使用带有密钥文件的服务帐户。但是,它抱怨在“写入”步骤未找到密钥文件。我尝试了很多方法,但没有一种起作用。如果在没有数据管道的情况下启动该应用程序,则该应用程序可以正常运行。

这是我尝试过的:

google.cloud.auth.service.account.json.keyfile =“ /home/hadoop/gs_test.json”

command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,/home/hadoop/app.jar,s3://myBucket/app.conf

google.cloud.auth.service.account.json.keyfile =“ /home/hadoop/gs_test.json”

command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,--files,/home/hadoop/gs_test.json, /home/hadoop/app.jar,s3://myBucket/app.conf

google.cloud.auth.service.account.json.keyfile =“ gs_test.json”

command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,--files,/home/hadoop/gs_test.json#gs_test.json, /home/hadoop/app.jar,s3://myBucket/app.conf

这里是错误:

java.io.FileNotFoundException: /home/hadoop/gs_test.p12 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:670)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount(CredentialFactory.java:234)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:90)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:113)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:88)
at org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter.<init>(DirectFileOutputCommitter.java:31)
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputCommitter(FileOutputFormat.java:310)
at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:36)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:146)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:246)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

任何想法如何使用AWS数据管道为Spark应用程序正确设置Google云存储?非常感谢您的帮助。

apache-spark google-cloud-storage google-cloud-dataproc amazon-data-pipeline spark-submit
1个回答
0
投票

如果我理解得很好:您想在Dataproc之外的Spark作业中使用GCS(gs://类型URL)。>

在这种情况下,您将必须安装GCS连接器,以使gs://网址映射器可用:https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/README.md

[上面的Github链接中的安装和设置说明。

© www.soinside.com 2019 - 2024. All rights reserved.