将键转换为列,将值转换为行(映射)

问题描述 投票:1回答:1

我有一个数据框,其中包含一个Map列和一个id列。

key1 -> value1, key2 -> value2
key1 -> value3, key2 -> value4

因此,我想要一个这样的数据框:

id   key1     key2
1    value1   value2
2    value3   value4

感谢您的帮助。

scala dataframe
1个回答
1
投票

我假设您正在谈论Spark DataFrame。在这种情况下,您可以使用DataFrame的map方法提取所需的值。这是一个使用spark-shell的示例(它将自动导入许多隐式方法)。

请注意,toDF使用了两次,一次是从内置数据结构中加载序列,另一次是重命名从原始DataFrame的map方法获得的新DataFrame中的列。

[show方法被调用以显示“之前”和“之后”

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val m = Map(1-> Map("key1" -> "v1", "key2" -> "v2"), 2 -> Map("key1" -> "v3", "key2" -> "v4"))
m: scala.collection.immutable.Map[Int,scala.collection.immutable.Map[String,String]] = Map(1 -> Map(key1 -> v1, key2 -> v2), 2 -> Map(key1 -> v3, key2 -> v4))

scala> val df = m.toSeq.toDF("id", "map_value")
df: org.apache.spark.sql.DataFrame = [id: int, map_value: map<string,string>]

scala> df.show()
+---+--------------------+
| id|           map_value|
+---+--------------------+
|  1|[key1 -> v1, key2...|
|  2|[key1 -> v3, key2...|
+---+--------------------+ 

scala> val get_map:Function1[Row, Map[String,String]] = r => r.getAs[Map[String, String]]("map_value")
get_map: org.apache.spark.sql.Row => Map[String,String] = <function1>

scala> df.map(r => (r.getAs[Int]("id"), get_map(r).get("key1"), get_map(r).get("key2"))).toDF("id", "val1", "val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
|  1|  v1|  v2|
|  2|  v3|  v4|
+---+----+----+

编辑:

这回答了如何处理可变数量的列。在这里,N是列数加一(所以有7列,N是8)。请注意,3是行数加一(这里有2行)。

在这种情况下,使用DataFrame的select方法更加方便,以避免必须动态创建元组。

scala> val N = 8
N: Int = 8

scala> val map_value:Function1[Int,Map[String,String]] = (i: Int) => Map((for (n <- Range(1, N)) yield (s"k${n}", s"v${n*i}")).toList:_*)
map_value: Int => Map[String,String] = <function1>

scala> val m = Map((for (i <- Range(1, 3)) yield (i, map_value(i))).toList:_*)
m: scala.collection.immutable.Map[Int,Map[String,String]] = Map(1 -> Map(k2 -> v2, k5 -> v5, k6 -> v6, k7 -> v7, k1 -> v1, k4 -> v4, k3 -> v3), 2 -> Map(k2 -> v4, k5 -> v10, k6 -> v12, k7 -> v14, k1 -> v2, k4 -> v8, k3 -> v6))

scala> val df0 = m.toSeq.toDF("id", "map_value")
df0: org.apache.spark.sql.DataFrame = [id: int, map_value: map<string,string>]

scala> val column_names:List[String] = (for (n <- Range(1, N)) yield (s"map_value.k${n}")).toList
column_names: List[String] = List(id, map_value.k1, map_value.k2, map_value.k3, map_value.k4, map_value.k5, map_value.k6, map_value.k7)

scala> df0.select("id", column_names:_*).show()
+---+---+---+---+---+---+---+---+
| id| k1| k2| k3| k4| k5| k6| k7|
+---+---+---+---+---+---+---+---+
|  1| v1| v2| v3| v4| v5| v6| v7|
|  2| v2| v4| v6| v8|v10|v12|v14|
+---+---+---+---+---+---+---+---+
© www.soinside.com 2019 - 2024. All rights reserved.