如何使用apache Spark和python创建嵌套对象

问题描述 投票:0回答:1

我在使用 apache Spark 和 python 创建嵌套对象时遇到问题。

我有以下数据框:

dataWithGps
root
 |-- vehicle_id: string (nullable = true)
 |-- organization_id: string (nullable = true)
 |-- start_time: string (nullable = true)
 |-- timestamp_gps: timestamp (nullable = true)
 |-- diagnostic_trouble_code: string (nullable = true)
 |-- gps: struct (nullable = false)
 |    |-- lat: double (nullable = true)
 |    |-- long: double (nullable = true)

我想对其进行转换,使其具有以下架构:

nestedData
root
 |-- organization_id: string (nullable = true)
 |-- start_time: string (nullable = true)
 |-- data: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- timestamp_gps: string (nullable = true)
 |    |    |-- vehicles: array (nullable = false)
 |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |-- vehicle_id: string (nullable = true)
 |    |    |    |    |-- gps: struct(nullable = true)

我有以下代码来执行此操作,但没有按我的预期工作:

    nestedFullDataRow = dataWithGps.groupby("organization_id", "vehicle_id", "timestamp_gps", "start_time").agg(F.collect_list(F.struct(F.col("timestamp_gps"))).alias("data")) \
        .groupBy("vehicle_id").agg(F.collect_list(struct(F.col("vehicle_id"))).alias("vehicles"))
    nestedFullDataRow .printSchema()

我意外收到以下架构:

root
 |-- vehicle_id: string (nullable = true)
 |-- vehicles: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- vehicle_id: string (nullable = true)
apache-spark pyspark
1个回答
0
投票

使用如下转换:

import pyspark.sql.functions as F
df = df.withColumn(
        "data",
        F.array(
          F.struct(
            F.col("timestamp"),
            F.array(
              F.struct(F.col("vehicle_id"), F.col("gps"))
            ).alias("vehicles")
          )
        )
      )

并删除您认为不必要的其他列

© www.soinside.com 2019 - 2024. All rights reserved.