我使用Spark和Scala:
import org.apache.spark.mllib.feature.StandardScaler
val scaler = new StandardScaler(withMean = true, withStd = true).fit(
labeledPoints.rdd.map(x => x.features)
)
val scaledLabledPoints = labeledPoints.map{ x =>
LabeledPoint(x.label, scaler.transform(x.features))
}
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
val numIter = 20
scaledLabledPoints.cache
val linearRegressionModel = LinearRegressionWithSGD.train(scaledLabledPoints, numIter)
此错误发生在最后一行:
<console>:64: error: type mismatch;
found : org.apache.spark.sql.Dataset[org.apache.spark.mllib.regression.LabeledPoint]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
val linearRegressionModel = LinearRegressionWithSGD.train(scaledLabledPoints, numIter)
^
如何解决这个错误以及它为什么会发生?
嘿,您正在使用DataFrames和Datasets,但也使用旧的RDD API进行Spark MLlib。你应该使用ML API:org.apache.spark.ml库(而不是mllib)
如果您仍想使用MLlib API,可以试试这个:
val linearRegressionModel = LinearRegressionWithSGD.train(scaledLabledPoints.rdd, numIter)