如何将DataFrame映射列转换为结构列?

问题描述 投票:-1回答:3

假设我们有一个DataFrame,其列为map类型。将其转换为struct的最直接方法是什么(或者等效地,定义一个具有相同键和值但作为struct类型的新列)?请参阅下面的spark-shell(2.4.5)会话,以获取效率极低的解决方法:

val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")

val jsonStr = df.withColumn("jsonized", to_json($"mapColumn")).select("jsonized").collect()(0)(0).asInstanceOf[String]

spark.read.json(Seq(jsonStr).toDS()).show()
+---+---+
|bar|foo|
+---+---+
|  2|  1|
+---+---+

现在,显然collect()效率很低,这通常是在Spark中执行操作的糟糕方法。但是,完成此转换的首选方法是什么? named_structnamed_struct都采用一系列参数值来构造结果,但是我找不到“解开”映射键/值以将它们传递给这些函数的任何方法。

apache-spark apache-spark-sql
3个回答
1
投票

我会使用struct函数:

struct

explode

输出:

+--------------------+
|           mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+

1
投票

我看到@chlebek的答案,但是如果应该将其保留在一行中,则可以使用UDF

df.select(explode('mapColumn)).select(struct('*).as("struct"))

0
投票

定义案例类

+--------+
|  struct|
+--------+
|[foo, 1]|
|[bar, 2]|
+--------+

root
 |-- struct: struct (nullable = false)
 |    |-- key: string (nullable = false)
 |    |-- value: integer (nullable = false)
scala> val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>]

scala> df.show
+--------------------+
|           mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+

scala> case class KeyValue(key: String, value: String)
defined class KeyValue

scala> val toArrayOfStructs = udf((value: Map[String, String]) => value.map {
     |   case (k, v) => KeyValue(k, v)
     | }.toArray )
toArrayOfStructs: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StructType(StructField(key,StringType,true), StructField(value,StringType,true)),true),Some(List(MapType(StringType,StringType,true))))

scala> df.withColumn("alfa", toArrayOfStructs(col("mapColumn")))
res4: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>, alfa: array<struct<key:string,value:string>>]

scala> res4.show
+--------------------+--------------------+
|           mapColumn|                alfa|
+--------------------+--------------------+
|[foo -> 1, bar -> 2]|[[foo, 1], [bar, 2]]|
+--------------------+--------------------+


scala> res4.printSchema
root
 |-- mapColumn: map (nullable = false)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = false)
 |-- alfa: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)
© www.soinside.com 2019 - 2024. All rights reserved.