如何在 Scala Dataframe 中的 Json 中填充缺失键的默认值?

问题描述 投票:0回答:1

我正在我的 scala 配置文件中读取 json 数据。我正在尝试提取客户 ID、位置、城市、州和状态(状态应位于地址键内)。由于 status 是可选键,因此它可能不会出现在所有记录的 json 数据中。在这种情况下,当我尝试引用获取状态时,它无法指出架构错误。即使 json 中没有某个默认值,我怎样才能读取它?

status key is supposed to be inside "address"
    [{
            "customerid": 123,
            "location": "NA",
            "address": {
                "city": "seattle",
                "state": "washington"
            }
        },
        {
            "customerid": 124,
            "location": "NA",
            "address": {
                "city": "seattle",
                "state": "washington"
            }
        }
    ]

output
customeid,location,city,state,status
scala apache-spark
1个回答
0
投票

检查下面的代码。

scala> df.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|input                                                                                                                                                                                                                                                                                                                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{\n            "customerid": 123,\n            "location": "NA",\n            "address": {\n                "city": "seattle",\n                "state": "washington"\n            }\n        },\n        {\n            "customerid": 124,\n            "location": "NA",\n            "address": {\n                "city": "seattle",\n                "state": "washington"\n            }\n        }\n    ]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
scala> val columns = Seq(
   "customerid", 
   "location", 
   "address['city'] AS city", 
   "address['state'] AS state", 
   "address['status'] AS status"
)
val colExprs = """named_struct(
    'customerid',
    in['customerid'],
    'location',
    in['location'],
    'address',
    from_json(in['address'], 'map<string, string>')
)"""
val jsonExprs = "from_json(input, 'array<map<string,string>>')"
df
.withColumn("input", expr(s"""transform(${jsonExprs}, in -> ${colExprs})"""))
.selectExpr("inline(input)")
.selectExpr(columns:_*)
.show(false)
+----------+--------+-------+----------+------+
|customerid|location|city   |state     |status|
+----------+--------+-------+----------+------+
|123       |NA      |seattle|washington|NULL  |
|124       |NA      |seattle|washington|NULL  |
+----------+--------+-------+----------+------+
© www.soinside.com 2019 - 2024. All rights reserved.