安全访问不存在的嵌套json属性Pyspark

Question

从 json 文件读取后，我尝试使用下面的代码在 Pyspark 中创建一个列，

observation_df.withColumn("contained_observations", F.explode(col("contained")))\
            .withColumn("code", col("contained_observations.code"))\
                .withColumn("code_text", col("code.text"))\
                .withColumn("coding", when(col("code").isNotNull(), col("code").getField("coding")).otherwise(None))\
.select(\
        col("code"),\
        col("code_text")\
        # col("coding")\
       )\
.printSchema()

“代码结构”内不存在“编码”字段然而，即使在包含 getField() 检查后，它仍然给出以下错误

AnalysisException：文本中没有这样的结构字段编码

即使输入中不存在，我怎样才能将其包含在没有值的数据框中？

也尝试了以下两个版本

.withColumn("coding", when(col("code").isNotNull(), col("code").getField("coding")).otherwise(None))
.withColumn("coding",  col("code").getField("coding").isNotNull())

在读取 json 时，我没有提供架构，因为架构未固定并且事先未知，因此 Spark 正在推断它。

架构是 根 -> 代码（结构） -> 文本（字符串）

Answer 1

这个问题的解决方案是显式检查结构体的 fields 属性，如下所示

if "coding" in [x.name for x in observation_df.schema["code"].dataType.fields]:
    observation_df = observation_df.withColumn("coding", col("code").getField("coding"))
else:
    observation_df = observation_df.withColumn("coding", lit(None))

安全访问不存在的嵌套json属性Pyspark

问题描述投票：0回答：1

1个回答

最新问题

安全访问不存在的嵌套json属性Pyspark

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1