尝试解析 Spark sql 2.3.0 存储的 xml 列。 Xml 字符串看起来像:
<foo>
<bar>
<sum>123</sum>
<periods>
<start>1</start>
<end>2</end>
</periods>
</bar>
<bar>
<sum>456</sum>
<periods></periods>
</bar>
<bar>
<sum>789</sum>
<periods>
<start>3</start>
<end>4</end>
</periods>
</bar>
</foo>
我想通过创建spark数组来解析和转置“sum”、“start”和“end”值,通过poseexplode获取位置,然后按位置获取数组元素。 Spark代码代码:
val Df1 = spark.sql("""
select
xpath(clob_data, '//foo/bar/sum/text()') as sum,
xpath(clob_data, '//foo/bar/periods/start/text()') as start,
xpath(clob_data, '//foo/bar/periods/end/text()') as end
from SourceDf
""")
Df1.createOrReplaceTempView("Df1")
val Df2 = spark.sql("""
select
pos,
sum[pos] as sum,
start[pos] as start,
end[pos] as end
from Df1
lateral view posexplode(sum)
exploded_id as pos, value
""").show
预期输出:
pos | sum | start | end
-------------------------------------
0 | 123 | 1 | 2
1 | 456 | null | null
2 | 789 | 3 | 4
结果输出:
pos | sum | start | end
-------------------------------------
0 | 123 | 1 | 2
1 | 456 | 3 | 4
2 | 789 | null | null
怎么了?
您可以使用spark-xml来实现与下面给出的相同的结果。
spark = (SparkSession.builder.master("local[*]")
.config('spark.jars.packages', 'com.databricks:spark-xml_2.12:0.18.0')
.appName("spark-db-xml").getOrCreate())
xmlFile = "pos.xml"
books = (spark.read
.format("xml")
.option("rowTag", "foo")
.load(xmlFile))
books = (books.select(posexplode(col("bar"))).withColumnRenamed("col", "bar")
.select("pos", "bar.sum", "bar.periods.start", "bar.periods.end"))
books.show(truncate=False) ```