我正在尝试散点图 Spark ml 库中 PCA 产生的 2 个特征。 更准确地说,我正在尝试将结果转换为如下所示:
_________
id | X | Y
__________
1 |0.1|0.1
2 |0.2|0.2
3 |0.4|0.4
4 |0.3|0.3
...
from something like this
_________
id | pca
__________
1 |[0.1,0.1]
2 |[0.2,0.2]
3 |[0.4,0.4]
4 |[0.3,0.3]
...
但火花向量似乎不可迭代或类似的东西。我不明白发生了什么事。如果有人知道答案,那就太好了
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
val convertToVector = udf((array: Array[Double]) => {
Vectors.dense(array.toArray)
})
val convertToDouble = udf((array: Array[Float]) => {
array.map(_.toDouble).toArray
})
val ds = model.userFactors.withColumn("features", convertToDouble($"features"))
val userMatrixDs = ds.withColumn("features", convertToVector($"features"))
//val df3 = assembler.transform(df2)
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pca")
.setK(2)
.fit(userMatrixDs)
// Project vectors to the linear space spanned by the top 2 principal
// components, keeping the label
val result = pca.transform(userMatrixDs).select("id","pca");
result.show()
result.select(
result.id,
result.col("pca")[0].as("eigenVector1"),
result.col("pca")[1].as("eigenVector2")
)
.show()
看一下这个例子:
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row(1, 1.0, 2.0))),
StructType(
List(
StructField("id", IntegerType),
StructField("one", DoubleType),
StructField("two", DoubleType)
)
))
import org.apache.spark.ml.linalg.Vector
import spark.implicits._
val assembler =
new VectorAssembler()
.setInputCols(Array("one", "two"))
.setOutputCol("vector")
val df0 = assembler.transform(df)
df0
.select("id", "vector")
.as[(Int, Vector)]
.map { case (id, vector) =>
val arr = vector.toArray
(id, arr(0), arr(1))
}
.select($"_1".as("id"), $"_2".as("pca_x"), $"_3".as("pca_y"))
首先,我使用 VectorAsembler 创建一个向量列,然后提取将其转换为数据集 [(Int, Vector)] 的值。使用地图,您可以轻松操纵行。