我如何解决NoClassDefFoundError:AWS EMR集群中的org / apache / spark / sql / types / DataType?

问题描述 投票:0回答:1

在AWS EMR(v 5.23.0)中提交Spark作业,我收到以下错误:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/types/DataType
    at etl.SparkDataProcessor$.processTransactionData(SparkDataProcessor.scala:51)
    at etl.SparkDataProcessor$.delayedEndpoint$etl$SparkDataProcessor$1(SparkDataProcessor.scala:17)
    at etl.SparkDataProcessor$delayedInit$body.apply(SparkDataProcessor.scala:11)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:383)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at etl.SparkDataProcessor$.main(SparkDataProcessor.scala:11)
    at etl.SparkDataProcessor.main(SparkDataProcessor.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.types.DataType
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
    ... 18 more

[已尝试在stackoverflow中的其他提交中解决同一问题,但仍然没有运气。在intelliJ中本地运行应用程序工作正常,我使用sbt assembly进行了构建。以下是我的build.sbt文件

注意。我什至添加了assemblyExcludedJars来看看是否有帮助。以前,那里不存在。

name := "blah"
version := "0.1"
scalaVersion := "2.11.0"
sparkVersion := "2.4.0"

artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
  artifact.name + "_" + sv.binary + "-" + sparkVersion.value + "_" + module.revision + "." + artifact.extension
}

lazy val doobieVersion = "0.8.6"

// Dependencies
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.4.0" % "provided",
  "org.apache.spark" %% "spark-sql" % "2.4.0" % "provided",
  "org.scalatest" %% "scalatest" % "3.0.8",
  "org.apache.hadoop" % "hadoop-common" % "2.9.2" % "provided",
  "org.apache.hadoop" % "hadoop-aws" % "2.9.2" % "provided",
  "com.amazonaws" % "aws-java-sdk-s3" % "1.11.46",
  "com.google.guava" % "guava" % "19.0",
  "com.typesafe.slick" %% "slick" % "3.3.1",
  "com.typesafe.slick" %% "slick-hikaricp" % "3.3.1",
  "mysql" % "mysql-connector-java" % "6.0.6",
  "com.microsoft.sqlserver" % "mssql-jdbc" % "8.2.0.jre8",
  // "com.github.geirolz" %% "advxml" % "2.0.0-RC1",
  "org.scalaj" %% "scalaj-http" % "2.4.2",
  "org.json4s" %% "json4s-native" % "3.6.7",
  "io.jvm.uuid" %% "scala-uuid" % "0.3.1"
)

// JVM Options
javaOptions ++= Seq("-Xms512m", "-Xmx2048M", "-XX:+CMSClassUnloadingEnabled")

// SBT Test Options
fork in Test := true
testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD")

assemblyExcludedJars in assembly := {
  // Exclude conflicting jars
  val cp = (fullClasspath in assembly).value
  cp.filter { f =>
    f.data.getName.contains("spark-core") ||
    f.data.getName.contains("spark-sql")
  }
}

// SBT Assembly Options
assemblyJarName in assembly := "blah.jar"
assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case "reference.conf"              => MergeStrategy.concat
  case x                             => MergeStrategy.first
}
apache-spark sbt amazon-emr
1个回答
0
投票

我能够通过SDK运行我的程序。我需要进行2次调整:

1)在我的env.sh步骤中添加一个附加命令以更新HADOOP_CLASSPATH:

sudo echo "export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/usr/lib/spark/jars/*"" | sudo tee -a /etc/hadoop/conf/hadoop-env.sh

2)通过删除某些参数(在下面评论)来更新我的步骤:

'HadoopJarStep' => array(
                'Args' => array(
                    'spark-submit',
                    //'--deploy-mode',
                    //'cluster',
                    //'yarn'
                    '--class',
                    'project.DataProcessor',
                    's3://parentFolder/subFolder/Project.jar'

                    ), // A list of command line arguments passed to the JAR file's main function when executed.
                'Jar' => 'command-runner.jar', // A path to a JAR file run during the step.
                 //'MainClass' => 'project.DataProcessor', // this is already specified in the Manifest of the fat JAR

            )
© www.soinside.com 2019 - 2024. All rights reserved.