从StringType Spark.SQL提取json数据

问题描述 投票:1回答:1

有带有单个字符串类型列的配置单元表。

hive> desc logical_control.test1;
OK
test_field_1          string                  test field 1
val df2 = spark.sql("select * from logical_control.test1")

df2.printSchema()
root
|-- test_field_1: string (nullable = true)
df2.show(false)
+------------------------+
|test_field_1            |
+------------------------+
|[[str0], [str1], [str2]]|
+------------------------+

如何将其转换为如下所示的结构化列?

root
|-- A: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- S: string (nullable = true)

我尝试使用初始架构来恢复它,该初始架构是在将数据写入hdfs之前对其进行结构化。但是json_data为null。

val schema = StructType(
    Seq(
      StructField("A", ArrayType(
        StructType(
          Seq(
            StructField("S", StringType, nullable = true))
        )
      ), nullable = true)
    )
  )

val df3 = df2.withColumn("json_data", from_json(col("test_field_1"), schema))

df3.printSchema()
root
|-- test_field_1: string (nullable = true)
|-- json_data: struct (nullable = true)
|    |-- A: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- S: string (nullable = true)
df3.show(false)
+------------------------+---------+
|test_field_1            |json_data|
+------------------------+---------+
|[[str0], [str1], [str2]]|null     |
+------------------------+---------+
json scala apache-spark etl
1个回答
1
投票

如果test_field_1的结构是固定的,并且您不介意自己“解析”该字段,则可以使用udf进行转换:

case class S(S:String)
def toArray: String => Array[S] = _.replaceAll("[\\[\\]]","").split(",").map(s => S(s.trim))
val toArrayUdf = udf(toArray)

val df3 = df2.withColumn("json_data", toArrayUdf(col("test_field_1")))
df3.printSchema()
df3.show(false)

打印

root
 |-- test_field_1: string (nullable = true)
 |-- json_data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- S: string (nullable = true)

+------------------------+------------------------+
|test_field_1            |json_data               |
+------------------------+------------------------+
|[[str0], [str1], [str2]]|[[str0], [str1], [str2]]|
+------------------------+------------------------+

棘手的部分是创建结构的第二级(element: struct)。我已使用案例类S创建此结构。

© www.soinside.com 2019 - 2024. All rights reserved.