我在使用 apache Spark 和 python 创建嵌套对象时遇到问题。
我有以下数据框:
dataWithGps
root
|-- vehicle_id: string (nullable = true)
|-- organization_id: string (nullable = true)
|-- start_time: string (nullable = true)
|-- timestamp_gps: timestamp (nullable = true)
|-- diagnostic_trouble_code: string (nullable = true)
|-- gps: struct (nullable = false)
| |-- lat: double (nullable = true)
| |-- long: double (nullable = true)
我想对其进行转换,使其具有以下架构:
nestedData
root
|-- organization_id: string (nullable = true)
|-- start_time: string (nullable = true)
|-- data: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- timestamp_gps: string (nullable = true)
| | |-- vehicles: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- vehicle_id: string (nullable = true)
| | | | |-- gps: struct(nullable = true)
我有以下代码来执行此操作,但没有按我的预期工作:
nestedFullDataRow = dataWithGps.groupby("organization_id", "vehicle_id", "timestamp_gps", "start_time").agg(F.collect_list(F.struct(F.col("timestamp_gps"))).alias("data")) \
.groupBy("vehicle_id").agg(F.collect_list(struct(F.col("vehicle_id"))).alias("vehicles"))
nestedFullDataRow .printSchema()
我意外收到以下架构:
root
|-- vehicle_id: string (nullable = true)
|-- vehicles: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- vehicle_id: string (nullable = true)
使用如下转换:
import pyspark.sql.functions as F
df = df.withColumn(
"data",
F.array(
F.struct(
F.col("timestamp"),
F.array(
F.struct(F.col("vehicle_id"), F.col("gps"))
).alias("vehicles")
)
)
)
并删除您认为不必要的其他列