无法从PCA“爆炸”火花向量

问题描述 投票:0回答:1

我正在尝试散点图 Spark ml 库中 PCA 产生的 2 个特征。 更准确地说,我正在尝试将结果转换为如下所示:

_________
id | X | Y
__________
1  |0.1|0.1
2  |0.2|0.2
3  |0.4|0.4
4  |0.3|0.3
...

from something like this

_________
id | pca
__________
1  |[0.1,0.1]
2  |[0.2,0.2]
3  |[0.4,0.4]
4  |[0.3,0.3]
...

但火花向量似乎不可迭代或类似的东西。我不明白发生了什么事。如果有人知道答案,那就太好了

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler


val convertToVector = udf((array: Array[Double]) => {
  Vectors.dense(array.toArray)
})

val convertToDouble = udf((array: Array[Float]) => {
  array.map(_.toDouble).toArray
})

val ds = model.userFactors.withColumn("features", convertToDouble($"features"))
val userMatrixDs = ds.withColumn("features", convertToVector($"features"))

//val df3 = assembler.transform(df2)

val pca = new PCA()
    .setInputCol("features")
    .setOutputCol("pca")
    .setK(2)
    .fit(userMatrixDs)
// Project vectors to the linear space spanned by the top 2 principal
// components, keeping the label
val result = pca.transform(userMatrixDs).select("id","pca");

result.show()

result.select(
    result.id,
    result.col("pca")[0].as("eigenVector1"),
    result.col("pca")[1].as("eigenVector2")
  )
  .show()
scala apache-spark vector pca
1个回答
0
投票

看一下这个例子:

val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row(1, 1.0, 2.0))),
StructType(
  List(
    StructField("id", IntegerType),
    StructField("one", DoubleType),
    StructField("two", DoubleType)
  )
))

import org.apache.spark.ml.linalg.Vector
import spark.implicits._

val assembler =
  new VectorAssembler()
    .setInputCols(Array("one", "two"))
    .setOutputCol("vector")

val df0 = assembler.transform(df)

df0
  .select("id", "vector")
  .as[(Int, Vector)]
  .map { case (id, vector) =>
    val arr = vector.toArray
    (id, arr(0), arr(1))
  }
  .select($"_1".as("id"), $"_2".as("pca_x"), $"_3".as("pca_y"))

首先,我使用 VectorAsembler 创建一个向量列,然后提取将其转换为数据集 [(Int, Vector)] 的值。使用地图,您可以轻松操纵行。

© www.soinside.com 2019 - 2024. All rights reserved.