我有一个数据框,如下所示,用*1、*2这样构造每个级别的json..并且“->”显示父节点的子节点
dataframe.show
id*1 | 姓名*1 | ppu*1 | 类型*1 | 配料1->id2 | 配料1->类型2 | 击球手1->击球手2->id*3 | 面糊2->类型3 |
---|---|---|---|---|---|---|---|
0001 | 蛋糕 | 0.55 | 甜甜圈 | 5001 | 无 | 1001 | 常规 |
0001 | 蛋糕 | 0.55 | 甜甜圈 | 5002 | 釉面 | 1002 | 巧克力 |
我需要输出为嵌套 json,如下所示
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
我尝试将数据帧转换为
dataframe.toJson
这给了我错误的输出,请帮助我如何迭代数据帧并创建如上所述的嵌套 json
第 1 步:将
type
和 id
列组合成 struct:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val df = ...
val df1 = df.withColumn("topping", struct(col("toppings1->id2").as("id"), col("toppings1->type2").as("type")))
.withColumn("batters", struct(col("batters1->batter2->id*3").as("id"), col("batter2->type3").as("type")))
结果:
root
|-- id*1: string (nullable = true)
|-- name*1: string (nullable = true)
|-- ppu*1: string (nullable = true)
|-- type*1: string (nullable = true)
|-- toppings1->id2: string (nullable = true)
|-- toppings1->type2: string (nullable = true)
|-- batters1->batter2->id*3: string (nullable = true)
|-- batter2->type3: string (nullable = true)
|-- topping: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- type: string (nullable = true)
|-- batters: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- type: string (nullable = true)
第 2 步*:按
id*1
分组:
val df2 = df1.groupBy("id*1")
.agg(first("name*1").as("name"),
first("ppu*1").as("ppu"),
first("type*1").as("type"),
collect_list("topping").as("toppings"),
collect_list("batters").as("batters"))
.withColumnRenamed("id*1", "id")
结果:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- ppu: string (nullable = true)
|-- type: string (nullable = true)
|-- toppings: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = true)
|-- batters: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = true)
第2步*:转换成Json:
df2.select(to_json(struct("id", "name", "ppu", "type", "toppings", "batters"))).show(truncate=false)
结果:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|to_json(struct(id, name, ppu, type, toppings, batters)) |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"id":"0001","name":"Cake","ppu":"0.55","type":"donut","toppings":[{"id":"5001","type":"None"},{"id":"5002","type":"Glazed"}],"batters":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+