我正在我的 scala 配置文件中读取 json 数据。我正在尝试提取客户 ID、位置、城市、州和状态(状态应位于地址键内)。由于 status 是可选键,因此它可能不会出现在所有记录的 json 数据中。在这种情况下,当我尝试引用获取状态时,它无法指出架构错误。即使 json 中没有某个默认值,我怎样才能读取它?
status key is supposed to be inside "address"
[{
"customerid": 123,
"location": "NA",
"address": {
"city": "seattle",
"state": "washington"
}
},
{
"customerid": 124,
"location": "NA",
"address": {
"city": "seattle",
"state": "washington"
}
}
]
output
customeid,location,city,state,status
检查下面的代码。
scala> df.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|input |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{\n "customerid": 123,\n "location": "NA",\n "address": {\n "city": "seattle",\n "state": "washington"\n }\n },\n {\n "customerid": 124,\n "location": "NA",\n "address": {\n "city": "seattle",\n "state": "washington"\n }\n }\n ]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
scala> val columns = Seq(
"customerid",
"location",
"address['city'] AS city",
"address['state'] AS state",
"address['status'] AS status"
)
val colExprs = """named_struct(
'customerid',
in['customerid'],
'location',
in['location'],
'address',
from_json(in['address'], 'map<string, string>')
)"""
val jsonExprs = "from_json(input, 'array<map<string,string>>')"
df
.withColumn("input", expr(s"""transform(${jsonExprs}, in -> ${colExprs})"""))
.selectExpr("inline(input)")
.selectExpr(columns:_*)
.show(false)
+----------+--------+-------+----------+------+
|customerid|location|city |state |status|
+----------+--------+-------+----------+------+
|123 |NA |seattle|washington|NULL |
|124 |NA |seattle|washington|NULL |
+----------+--------+-------+----------+------+