我正在尝试使用 Spark MLLIB 的分布式 Kmeans 运行分布式 Kmeans,但出现以下错误:
Caused by: java.lang.ClassNotFoundException: breeze.storage.Zero$DoubleZero$
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
我正在使用 scala 2.13.0 和 spark 3.3.0。 and breeze 2.1.0 有谁知道怎么解决吗?
这里是一个重现错误的小例子:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
object example {
def main(args: Array[String]): Unit = {
val data = List(Vectors.dense(Array(-1.2067543462416856,1.3095550194913217)),
Vectors.dense(Array(0.07214871343256794,1.2317180069067792)),
Vectors.dense(Array(1.2382694463625876,1.498952083293292)),
Vectors.dense(Array(1.4227882484992194,1.1326606729937694)),
Vectors.dense(Array(0.028564865614650627,1.1697757168356784)),
Vectors.dense(Array(1.3008028016732505,1.3992632244080325)),
Vectors.dense(Array(-0.4515288119480808,-0.44940482288858774)),
Vectors.dense(Array(1.3912470190900275,-1.2895692645735999)),
Vectors.dense(Array(-0.5498887597576244,-0.4937628444210279)),
Vectors.dense(Array(0.03640545102051686,-1.3540754314126295)),
Vectors.dense(Array(-1.2520223542111055,1.2709646562853476)))
Logger.getLogger("org").setLevel(Level.OFF)
val SS = SparkSession
.builder()
.appName("example")
.config("spark.master", "local[*]").getOrCreate()
val sc = SS.sparkContext
val rdd = sc.parallelize(data)
val kmeans = KMeans.train(rdd,10,100)
}
}
看起来像是依赖关系的问题。
微风 1.3-
breeze.storage.Zero.DoubleZero
被定义为
@SerialVersionUID(1L)
implicit object DoubleZero extends Zero[Double] {
override def zero = 0.0
}
和
breeze.storage.Zero.DoubleZero.getClass
产生breeze.storage.Zero$DoubleZero$
.
但是在breeze 2.0+中
DoubleZero
定义为
implicit val DoubleZero: Zero[Double] = Zero(0.0)
@SerialVersionUID(1L)
case class Zero[@specialized T](zero: T) extends Serializable
和
breeze.storage.Zero.DoubleZero.getClass
产生breeze.storage.Zero$mcD$sp
(因为@specialized
)而Class.forName("breeze.storage.Zero$DoubleZero$")
抛出ClassNotFoundException
.
你应该看看什么依赖还用breeze 1.3-
更新。感谢MCVE。
调试显示
NoClassDefFoundError
/ClassNotFoundException
被抛到这里
private lazy val loadableSparkClasses: Seq[Class[_]] = {
Seq(
// ...
"org.apache.spark.ml.linalg.SparseMatrix",
// ...
).flatMap { name =>
try {
Some[Class[_]](Utils.classForName(name))
} catch {
case NonFatal(_) => None // do nothing
case _: NoClassDefFoundError if Utils.isTesting => None // See SPARK-23422.
}
}
}
更简单的复制是
Class.forName("org.apache.spark.ml.linalg.SparseMatrix")
// java.lang.NoClassDefFoundError: breeze/storage/Zero$DoubleZero$ ...
// Caused by: java.lang.ClassNotFoundException: breeze.storage.Zero$DoubleZero$ ...
正如我所说,其中一个依赖项使用 breeze 1.3- 尽管您认为您使用的是 breeze 2.1.0。即,
org.apache.spark.ml.linalg.SparseMatrix
来自 spark-mllib-local
和 spark-mllib-local
3.3.0 使用 breeze 1.2
<dependency>
<groupId>org.scalanlp</groupId>
<artifactId>breeze_2.13</artifactId>
<version>1.2</version>
<scope>compile</scope>
<exclusions>
<exclusion>
<artifactId>commons-math3</artifactId>
<groupId>org.apache.commons</groupId>
</exclusion>
</exclusions>
</dependency>
所以 Spark 3.3.0(和 3.3.2)与 breeze 2.0+ 不兼容。使用微风 1.3-
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % "3.3.0",
"org.apache.spark" %% "spark-mllib" % "3.3.0",
"org.scalanlp" %% "breeze" % "1.3"
)
然后你的代码运行成功。
https://github.com/scalanlp/breeze/issues/710
https://github.com/scalanlp/breeze/issues/690
Breeze 应该在 Spark 3.4.0 中升级到 2.0