使用嵌套数组和StructType Spark Scala展平Parquet文件

问题描述 投票:0回答:1

我希望有效地在Spark中使用Scala动态压平镶木地板文件。我想知道实现这一目标的有效方法。

镶木地板文件包含多个深度级别的多个阵列和结构类型嵌套。镶木地板文件架构将来可能会发生变化,因此我无法对任何属性进行硬编码。期望的最终结果是展平的分隔文件。

使用flatmap和递归爆炸工作的解决方案会是什么?

示例架构:

|-- exCar: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- exCarOne: string (nullable = true)
 |    |    |-- exCarTwo: string (nullable = true)
 |    |    |-- exCarThree: string (nullable = true)
 |-- exProduct: string (nullable = true)
 |-- exName: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- exNameOne: string (nullable = true)
 |    |    |-- exNameTwo: string (nullable = true)
 |    |    |-- exNameThree: string (nullable = true)
 |    |    |-- exNameFour: string (nullable = true)
 |    |    |-- exNameCode: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- exNameCodeOne: string (nullable = true)
 |    |    |    |    |-- exNameCodeTwo: string (nullable = true)
 |    |    |-- exColor: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- exColorOne: string (nullable = true)
 |    |    |    |    |-- exColorTwo: string (nullable = true)
 |    |    |    |    |-- exWheelColor: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- exWheelColorOne: string (nullable = true)
 |    |    |    |    |    |    |-- exWheelColorTwo: string (nullable = true)
 |    |    |    |    |    |    |--exWheelColorThree: string (nullable =true)
 |    |    |-- exGlass: string (nullable = true)
 |-- exDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- exBill: string (nullable = true)
 |    |    |-- exAccount: string (nullable = true)
 |    |    |-- exLoan: string (nullable = true)
 |    |    |-- exRate: string (nullable = true)

期望的输出架构:

 exCar.exCarOne
 exCar.exCarTwo
 exCar.exCarThree
 exProduct
 exName.exNameOne
 exName.exNameTwo
 exName.exNameThree
 exName.exNameFour
 exName.exNameCode.exNameCodeOne
 exName.exNameCode.exNameCodeTwo
 exName.exColor.exColorOne
 exName.exColor.exColorTwo
 exName.exColor.exWheelColor.exWheelColorOne
 exName.exColor.exWheelColor.exWheelColorTwo
 exName.exColor.exWheelColor.exWheelColorThree
 exName.exGlass
 exDetails.exBill
 exDetails.exAccount
 exDetails.exLoan
 exDetails.exRate
scala apache-spark apache-spark-sql parquet flatten
1个回答
0
投票

有两件事需要做:

1)从最外层的嵌套数组中爆炸数组列到里面的数组:爆炸exName(给你很多行包含json,包含exColor),然后exColor然后你爆炸,允许你访问exWheelColor等。

2)将嵌套的json投影到单独的列。

© www.soinside.com 2019 - 2024. All rights reserved.