为什么Spark应用程序失败并出现“ClassNotFoundException:找不到数据源:kafka”作为带有sbt程序集的uber-jar?

问题描述 投票:19回答:7

我正在尝试运行像https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredKafkaWordCount.scala这样的样本。我从http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html的Spark Structured Streaming Programming指南开始。

我的代码是

package io.boontadata.spark.job1

import org.apache.spark.sql.SparkSession

object DirectKafkaAggregateEvents {
  val FIELD_MESSAGE_ID = 0
  val FIELD_DEVICE_ID = 1
  val FIELD_TIMESTAMP = 2
  val FIELD_CATEGORY = 3
  val FIELD_MEASURE1 = 4
  val FIELD_MEASURE2 = 5

  def main(args: Array[String]) {
    if (args.length < 3) {
      System.err.println(s"""
        |Usage: DirectKafkaAggregateEvents <brokers> <subscribeType> <topics>
        |  <brokers> is a list of one or more Kafka brokers
        |  <subscribeType> sample value: subscribe
        |  <topics> is a list of one or more kafka topics to consume from
        |
        """.stripMargin)
      System.exit(1)
    }

    val Array(bootstrapServers, subscribeType, topics) = args

    val spark = SparkSession
      .builder
      .appName("boontadata-spark-job1")
      .getOrCreate()

    import spark.implicits._

    // Create DataSet representing the stream of input lines from kafka
    val lines = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", bootstrapServers)
      .option(subscribeType, topics)
      .load()
      .selectExpr("CAST(value AS STRING)")
      .as[String]

    // Generate running word count
    val wordCounts = lines.flatMap(_.split(" ")).groupBy("value").count()

    // Start running the query that prints the running counts to the console
    val query = wordCounts.writeStream
      .outputMode("complete")
      .format("console")
      .start()

    query.awaitTermination()
  }

}

我添加了以下sbt文件:

build.sbt:

name := "boontadata-spark-job1"
version := "0.1"
scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.2" % "provided"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.0.2" % "provided"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.0.2" % "provided"
libraryDependencies += "org.apache.spark" % "spark-sql-kafka-0-10_2.11" % "2.0.2"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.0.2"
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.10.1.1"
libraryDependencies += "org.apache.kafka" % "kafka_2.11" % "0.10.1.1"

// META-INF discarding
assemblyMergeStrategy in assembly := { 
   {
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
    case x => MergeStrategy.first
   }
}

我还添加了project / assembly.sbt

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")

这创造了一个Uber jar与非provided罐子。

我提交以下行:

spark-submit boontadata-spark-job1-assembly-0.1.jar ks1:9092,ks2:9092,ks3:9092 subscribe sampletopic

但我得到这个运行时错误:

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
        at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:218)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
        at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
        at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
        at io.boontadata.spark.job1.DirectKafkaAggregateEvents$.main(StreamingJob.scala:41)
        at io.boontadata.spark.job1.DirectKafkaAggregateEvents.main(StreamingJob.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132)
        at scala.util.Try.orElse(Try.scala:84)
        at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:132)
        ... 18 more
16/12/23 13:32:48 INFO spark.SparkContext: Invoking stop() from shutdown hook

有没有办法知道找不到哪个类,以便我可以搜索该类的maven.org repo。

lookupDataSource源代码似乎在https://github.com/apache/spark/blob/83a6ace0d1be44f70e768348ae6688798c84343e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala的第543行,但我找不到与Kafka数据源的直接链接...

完整的源代码在这里:https://github.com/boontadata/boontadata-streams/tree/ad0d0134ddb7664d359c8dca40f1d16ddd94053f

scala apache-spark sbt sbt-assembly spark-structured-streaming
7个回答
19
投票

我试过这样对我有用。提交后,如果您有任何问题,请告诉我

./spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0 --class com.inndata.StructuredStreaming.Kafka --master local[*] /Users/apple/.m2/repository/com/inndata/StructuredStreaming/0.0.1SNAPSHOT/StructuredStreaming-0.0.1-SNAPSHOT.jar

18
投票

问题是build.sbt的以下部分:

// META-INF discarding
assemblyMergeStrategy in assembly := { 
   {
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
    case x => MergeStrategy.first
   }
}

它说所有qazxsw poi条目都应该被丢弃,包括使数据源别名(例如qazxsw poi)工作的“代码”。

META-INF文件对于kafka(和其他流数据源别名)非常重要。

对于META-INF别名工作,Spark SQL使用带有以下条目的kafka

kafka

META-INF/services/org.apache.spark.sql.sources.DataSourceRegister org.apache.spark.sql.kafka010.KafkaSourceProvider KafkaSourceProvider别名与正确的流数据源,即is responsible to register

只是为了检查实际代码是否确实可用,但是注册了别名的“代码”不是,您可以使用完全限定名称(而不是别名)的kafka数据源,如下所示:

KafkaSource

由于缺少像kafka这样的选项,你会看到其他问题,但是......我们正在离题。

一个解决方案是spark.readStream. format("org.apache.spark.sql.kafka010.KafkaSourceProvider"). load 所有kafka.bootstrap.servers(这将创建一个包含所有数据源的超级jar,包括MergeStrategy.concat数据源)。

META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

3
投票

在我的情况下,我在使用sbt编译时也遇到了这个错误,原因是kafka不包括case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat 神器作为胖罐的一部分。

(我非常欢迎这里的评论。依赖关系没有被指定为范围,所以不应该假设它是“提供”)。

所以我改为部署普通(苗条)jar并将sbt assembly参数的依赖项包括在spark-submit中。

为了在一个地方收集所有依赖项,您可以将spark-sql-kafka-0-10_2.11添加到您的sbt项目设置,或者您可以在sbt控制台中发出:

--jars

这应该将所有依赖项带到retrieveManaged := true文件夹。

然后你可以复制所有这些文件(使用bash命令,你可以使用这样的东西

> set retrieveManaged := true
> package

1
投票

我正在使用spark 2.1并面临同样的问题我的解决方法

1)lib_managed

2)cd /path/to/your/project JARLIST=$(find lib_managed -name '*.jar'| paste -sd , -) spark-submit [other-args] target/your-app-1.0-SNAPSHOT.jar --jars "$JARLIST" 在这里,所有需要的罐子现在都在这个文件夹中

3)将此文件夹中的所有jar复制到所有节点(可以创建保存它们的特定文件夹)

4)将文件夹名称添加到spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0cd ~/.ivy2/jars,例如。 spark.driver.extraClassPath

5 spark.driver.extraClassPath现在工作正常


1
投票

这是鉴于Jacek Laskowski的回答。

那些在maven上构建项目的人可以试试这个。将下面提到的行添加到maven-shade-plugin中。

META-INF /服务/ org.apache.spark.sql.sources.DataSourceRegister

我已经放下了pom文件的插件代码作为示例来显示添加行的位置。


spark.driver.extraClassPath=/opt/jars/*:your_other_jars

请原谅我的格式化技巧。


0
投票

我通过将jar文件下载到驱动程序系统来解决它。从那里,我通过--jar选项将jar提供给spark提交。

另外需要注意的是,我在我的超级罐中包装了整个spark 2.1环境(因为我的群集仍在1.6.1上)由于某种原因,当它包含在超级罐中时它没有被拾取。

spark-submit --jar /ur/path/spark-sql-kafka-0-10_2.11:2.1.0 --class ClassNm --Other-Options YourJar.jar


0
投票

我使用gradle作为构建工具,使用shadowJar插件创建uberJar。解决方案只是添加一个文件

spark-submit --class ClassNm --Other-Options YourJar.jar

到项目。

在这个文件中,你需要逐行放置你使用的DataSources的类名,在这种情况下它将是<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.1.0</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource> META-INF/services/org.apache.spark.sql.sources.DataSourceRegister </resource> </transformer> </transformers> <finalName>${project.artifactId}-${project.version}-uber</finalName> </configuration> </execution> </executions> </plugin> (找到类名称,例如src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

原因是Spark在其内部依赖管理机制中使用了java org.apache.spark.sql.kafka010.KafkaSourceProvider

完整的例子here

© www.soinside.com 2019 - 2024. All rights reserved.