我正在Java中使用Spark处理XML文件。来自databricks的spark-xml软件包用于将xml文件读入数据帧。
示例xml文件为:
<RowTag>
<id>1</id>
<name>john</name>
<expenses>
<travel>
<details>
<date>20191203</date>
<amount>400</amount>
</details>
</travel>
</expenses>
</RowTag>
<RowTag>
<id>2</id>
<name>joe</name>
<expenses>
<food>
<details>
<date>20191204</date>
<amount>500</amount>
</details>
</food>
</expenses>
</RowTag>
结果火花Dataset<Row> df
如下所示,每一行代表一个xml文件。
+--+------+----------------+
|id| name |expenses |
+---------+----------------+
|1 | john |[[20191203,400]]|
|2 | joe |[[20191204,500]]|
+--+------+----------------+
df.printSchema();
显示如下:
root
|-- id: int(nullable = true)
|-- name: string(nullable = true)
|-- expenses: struct (nullable = true)
| |-- travel: struct (nullable = true)
| | |-- details: struct (nullable = true)
| | | |-- date: string (nullable = true)
| | | |-- amount: int (nullable = true)
| |-- food: struct (nullable = true)
| | |-- details: struct (nullable = true)
| | | |-- date: string (nullable = true)
| | | |-- amount: int (nullable = true)
所需的输出数据帧如下:
+--+------+-------------+
|id| name |expenses_date|
+---------+-------------+
|1 | john |20191203 |
|2 | joe |20191204 |
+--+------+-------------+
我尝试过的事情:
spark.udf().register("getDate",(UDF1 <Row, String>) (Row row) -> {
return row.getStruct(0).getStruct(0).getAs("date").toString();
}, DataTypes.StringType);
df.select(callUDF("getDate",df.col("expenses")).as("expenses_date")).show();
但是它不起作用,因为row.getStruct(0)路由到<travel>
,但是对于行joe,<travel>
下没有<expenses>
标记,因此它返回了java.lang.NullPointerException
。我想要的是一种通用解决方案,对于每行,它可以自动获取下一个标签名称,例如row.getStruct(0)
路由到john行的<travel>
,并路由到joe的<food>
。
所以我的问题是:我应该如何重新格式化UDF以实现这一目标?
谢谢!! :)
df.selectExpr("id", "name", "COALESCE(`expenses`.`food`.`details`.`date`, `expenses`.`travel`.`details`.`date`) AS expenses_date" ).show()
输出:
+---+----+-------------+ | id|name|expenses_date| +---+----+-------------+ | 1|john| 20191203| | 2| joe| 20191204| +---+----+-------------+