将数据帧转换为Spark Scala中的转换数据帧

问题描述 投票:-2回答:2

所以,我在Spark中有一个DataFrame,看起来像这样:

[name,target] this is the header
[ABCD,1]
[XYZA,1]
[GFFD,1]
[NAAS,1]
[ABCD,2]
[XYZA,2]
[NAAS,2]
[VDDE,2]

而且我想像这样将其转换为数据框

[name, count(target=1), count(target=2)]
[ABCD, 1,1]
[XYZA, 1,1]
[GFFD, 1,0]
AND SO ON.....

有没有办法做到这一点?

pandas scala apache-spark
2个回答
1
投票

这是两个可能的解决方案。

样本输入数据:

import spark.implicits._
val df = Seq(
  ("ABCD",1),
  ("XYZA",1),
  ("GFFD",1),
  ("NAAS",1),
  ("ABCD",2),
  ("XYZA",2),
  ("NAAS",2),
  ("VDDE",2),
  ("EXAMPLE", 20)
).toDF("name", "target")

df.show()

+-------+------+
|   name|target|
+-------+------+
|   ABCD|     1|
|   XYZA|     1|
|   GFFD|     1|
|   NAAS|     1|
|   ABCD|     2|
|   XYZA|     2|
|   NAAS|     2|
|   VDDE|     2|
|EXAMPLE|    20|
+-------+------+

1-使用map仅返回非零出现。

case class DataItem(name: String, target: Int)

df.as[DataItem]
  .groupByKey(_.name)
  .mapGroups{
    case (nameKey, targetIter) =>{
     val targetList = targetIter.map(_.target).toSeq
     val occMap = targetList.groupBy(a=>a).mapValues(_.size)
      (nameKey, occMap)
    }
  }
  .toDF("name", "target_count").show()


+-------+----------------+
|   name|    target_count|
+-------+----------------+
|   VDDE|        [2 -> 1]|
|   NAAS|[2 -> 1, 1 -> 1]|
|EXAMPLE|       [20 -> 1]|
|   GFFD|        [1 -> 1]|
|   XYZA|[2 -> 1, 1 -> 1]|
|   ABCD|[2 -> 1, 1 -> 1]|
+-------+----------------+

2-使用列表显示出现次数(包括0),其中索引= target_value。

case class DataItem(name: String, target: Int)

df.as[DataItem]
  .groupByKey(_.name)
  .mapGroups{
    case (nameKey, targetIter) =>{
       val targetList = targetIter.map(_.target).toSeq
       val occMap = targetList.groupBy(a=>a).mapValues(_.size)
       val maxTarget = occMap.maxBy(_._2)._1 
       val occList = for (i <- 1 until maxTarget+1) yield occMap.getOrElse(i, 0)

      (nameKey, occList)
    }
  }
  .toDF("name", "target_count").show(20, false)


+-------+------------------------------------------------------------+
|name   |target_count                                                |
+-------+------------------------------------------------------------+
|VDDE   |[0, 1]                                                      |
|NAAS   |[1, 1]                                                      |
|EXAMPLE|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]|
|GFFD   |[1]                                                         |
|XYZA   |[1, 1]                                                      |
|ABCD   |[1, 1]                                                      |
+-------+------------------------------------------------------------+

0
投票

数据框可以通过“枢轴”进行转换:

  df
  .groupBy("name")
  .pivot("target")
  .count()
    // replace nulls with 0
  .na.fill(0)

[使用Cesar A. Mostacero提供的数据,结果是:

+-------+---+---+---+
|name   |1  |2  |20 |
+-------+---+---+---+
|EXAMPLE|0  |0  |1  |
|XYZA   |1  |1  |0  |
|GFFD   |1  |0  |0  |
|VDDE   |0  |1  |0  |
|ABCD   |2  |1  |0  |
|NAAS   |1  |1  |0  |
+-------+---+---+---+
© www.soinside.com 2019 - 2024. All rights reserved.