有带有单个字符串类型列的配置单元表。
hive> desc logical_control.test1; OK test_field_1 string test field 1
val df2 = spark.sql("select * from logical_control.test1")
df2.printSchema()
root |-- test_field_1: string (nullable = true)
df2.show(false)
+------------------------+ |test_field_1 | +------------------------+ |[[str0], [str1], [str2]]| +------------------------+
如何将其转换为如下所示的结构化列?
root |-- A: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- S: string (nullable = true)
我尝试使用初始架构来恢复它,该初始架构是在将数据写入hdfs之前对其进行结构化。但是json_data为null。
val schema = StructType(
Seq(
StructField("A", ArrayType(
StructType(
Seq(
StructField("S", StringType, nullable = true))
)
), nullable = true)
)
)
val df3 = df2.withColumn("json_data", from_json(col("test_field_1"), schema))
df3.printSchema()
root |-- test_field_1: string (nullable = true) |-- json_data: struct (nullable = true) | |-- A: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- S: string (nullable = true)
df3.show(false)
+------------------------+---------+ |test_field_1 |json_data| +------------------------+---------+ |[[str0], [str1], [str2]]|null | +------------------------+---------+
如果test_field_1
的结构是固定的,并且您不介意自己“解析”该字段,则可以使用udf进行转换:
case class S(S:String)
def toArray: String => Array[S] = _.replaceAll("[\\[\\]]","").split(",").map(s => S(s.trim))
val toArrayUdf = udf(toArray)
val df3 = df2.withColumn("json_data", toArrayUdf(col("test_field_1")))
df3.printSchema()
df3.show(false)
打印
root
|-- test_field_1: string (nullable = true)
|-- json_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- S: string (nullable = true)
+------------------------+------------------------+
|test_field_1 |json_data |
+------------------------+------------------------+
|[[str0], [str1], [str2]]|[[str0], [str1], [str2]]|
+------------------------+------------------------+
棘手的部分是创建结构的第二级(element: struct
)。我已使用案例类S
创建此结构。