如何在Spark SQL中解析XML?

问题描述 投票:0回答:1

尝试解析 Spark sql 2.3.0 存储的 xml 列。 Xml 字符串看起来像:

     <foo>
     <bar>
       <sum>123</sum>      
       <periods>
         <start>1</start>
         <end>2</end>
       </periods>
     </bar>
     <bar>
       <sum>456</sum>
       <periods></periods>
     </bar>
     <bar>
       <sum>789</sum>
       <periods>
         <start>3</start>
         <end>4</end>
       </periods>
     </bar>
     </foo>

我想通过创建spark数组来解析和转置“sum”、“start”和“end”值,通过poseexplode获取位置,然后按位置获取数组元素。 Spark代码代码:

val Df1 = spark.sql("""
    select
        xpath(clob_data, '//foo/bar/sum/text()') as sum,
        xpath(clob_data, '//foo/bar/periods/start/text()') as start,
        xpath(clob_data, '//foo/bar/periods/end/text()') as end
    from SourceDf
""")
Df1.createOrReplaceTempView("Df1")

val Df2 = spark.sql("""
    select
        pos,
        sum[pos] as sum,
        start[pos] as start,
        end[pos] as end
    from Df1
    lateral view posexplode(sum)
        exploded_id as pos, value
""").show

预期输出:

    pos  |  sum  |  start  |  end
-------------------------------------
    0    |  123  |   1     |   2
    1    |  456  |  null   |  null
    2    |  789  |   3     |   4

结果输出:

    pos  |  sum  |  start  |  end
-------------------------------------
    0    |  123  |   1     |   2
    1    |  456  |   3     |   4
    2    |  789  |  null   |  null

怎么了?

xml scala apache-spark apache-spark-sql
1个回答
0
投票

您可以使用spark-xml来实现与下面给出的相同的结果。

spark = (SparkSession.builder.master("local[*]")
             .config('spark.jars.packages', 'com.databricks:spark-xml_2.12:0.18.0')
             .appName("spark-db-xml").getOrCreate())
    
    xmlFile = "pos.xml"
    books = (spark.read
             .format("xml")
             .option("rowTag", "foo")
             .load(xmlFile))

    books = (books.select(posexplode(col("bar"))).withColumnRenamed("col", "bar")
             .select("pos", "bar.sum", "bar.periods.start", "bar.periods.end"))
    books.show(truncate=False) ```
© www.soinside.com 2019 - 2024. All rights reserved.