我有以下 JSON 字符串作为 pyspark 数据框中的一列。
{
"result":{
"version":"1.2",
"timeStamp":"2023-08-14 14:00:12",
"description":"",
"data":{
"DateTime_Received":"2023-08-14T14:01:10.4516457+01:00",
"DateTime_Actual":"2023-08-14T14:00:12",
"OtherInfo":null,
"main":[
{
"Status":0,
"ID":111,
"details":null
}
]
},
"tn":"aaa"
}
}
我想将上面的内容分解为多列,而不对架构进行硬编码。
我尝试使用 schema_of_json 从 json 字符串生成模式。
df_decoded = df_decoded.withColumn("json_column", F.when(F.col("value").isNotNull(), F.col("value")).otherwise("{}"))
# Infer the schema using schema_of_json
json_schema = df_decoded.select(F.schema_of_json(F.col("json_column"))).collect()[0][0]
df_decoded 是我的数据框,值是我的 json 字符串列名称。
但是它给了我以下错误-
AnalysisException: cannot resolve 'schema_of_json(json_column)' due to data type mismatch: The input json should be a foldable string expression and not null; however, got json_column.;
这是否让你开始上路了?
import json
import pandas as pd
j = '''{
"result":{
"version":"1.2",
"timeStamp":"2023-08-14 14:00:12",
"description":"",
"data":{
"DateTime_Received":"2023-08-14T14:01:10.4516457+01:00",
"DateTime_Actual":"2023-08-14T14:00:12",
"OtherInfo":null,
"main":[
{
"Status":0,
"ID":111,
"details":null
}
]
},
"tn":"aaa"
}
}'''
text_json = json.loads(j)
result=text_json.get("result", "")
print(result.get("version", ""))
results = [result["version"], result["timeStamp"], result["description"], result["data"], result["tn"] ]
df = pd.DataFrame(results).transpose()
print(df)
我没有真正的应用程序可以玩
.transpose()
就是改变。