数据集过滤器工作异常

问题描述 投票:1回答:1

场景:我已经通过在加载时指定架构读取了两个XML文件。

在架构中,标记之一是必需的。一种XML缺少该必需标记。

现在,当我执行以下操作时,我期望具有必需标记的XML被过滤掉。

dataset = dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull());

在代码中,当我尝试计算数据集的行数时,我得到的计数为2(2个输入XMLS),但是当我尝试通过show()方法打印数据集时,我得到了NPE。

调试上面的行并执行以下操作时,计数为0。

dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull()).count();

问题:

任何人都可以在下面回答问题/确认我的理解

  1. 为什么spark数据集不过滤没有必填列的行?
  2. 为什么计数没有例外,但show方法为何?

对于2,我相信该计数只是在不查看内容的情况下对行数进行计数。为了进行显示,迭代器实际上遍历了“结构体字段”以打印其值,并且当找不到强制列时,它就会出错。

P.S。如果我将强制性列设为可选,则一切正常。

编辑:

根据要求提供阅读选项

为了加载数据,我正在执行以下操作

Dataset<Row> dataset = sparkSession.read().format("com.databricks.spark.xml")
                .option("header", "true")
                .option("inferSchema", "false")
                .option("rowTag", rowTag)//rowTag is "body" tag in the XML
                .option("failFast", "true")
                .option("mode", "FAILFAST")
                .schema(schema)
                .load(XMLfilePath);

根据要求提供样品

模式:

root
 |-- old: struct (nullable = true)
 |    |-- _beyond: string (nullable = true)
 |    |-- lot: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _chose: string (nullable = true)
 |    |-- real: struct (nullable = true)
 |    |    |-- _eat: string (nullable = true)
 |    |    |-- kill: struct (nullable = true)
 |    |    |    |-- _VALUE: double (nullable = true)
 |    |    |    |-- _top: string (nullable = true)
 |    |    |-- tool: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _affect: string (nullable = true)
 |-- porch: struct (nullable = true)
 |    |-- _account: string (nullable = true)
 |    |-- cast: string (nullable = true)
 |    |-- vegetable: struct (nullable = true)
 |    |    |-- leg: struct (nullable = true)
 |    |    |    |-- _VALUE: double (nullable = true)
 |    |    |    |-- _nose: string (nullable = true)
 |    |    |-- now: struct (nullable = true)
 |    |    |    |-- _gravity: string (nullable = true)
 |    |    |    |-- chief: struct (nullable = true)
 |    |    |    |    |-- _VALUE: long (nullable = true)
 |    |    |    |    |-- _further: string (nullable = true)
 |    |    |    |-- field: string (nullable = true)

示例XML:

<?xml version="1.0" encoding="UTF-8" ?>
<root>
    <body>
    <porch account="something">
        <vegetable>
            <now gravity="wide">
                <field>box</field>
                <chief further="satisfied">-1889614487</chief>
            </now>
            <leg nose="angle">912658017.229279</leg>
        </vegetable>
        <cast>clear</cast>
    </porch>
    <old beyond="continent">
        <real eat="term">
            <kill top="plates">-1623084908.8669372</kill>
            <tool affect="pond">today</tool>
        </real>
        <lot chose="swung">promised</lot>
    </old>
    </body>
</root>

JSON格式的架构:

{"type":"struct","fields":[{"name":"old","type":{"type":"struct","fields":[{"name":"_beyond","type":"string","nullable":true,"metadata":{}},{"name":"lot","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_chose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"real","type":{"type":"struct","fields":[{"name":"_eat","type":"string","nullable":true,"metadata":{}},{"name":"kill","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_top","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"tool","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_affect","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"porch","type":{"type":"struct","fields":[{"name":"_account","type":"string","nullable":true,"metadata":{}},{"name":"cast","type":"string","nullable":true,"metadata":{}},{"name":"vegetable","type":{"type":"struct","fields":[{"name":"leg","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_nose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"now","type":{"type":"struct","fields":[{"name":"_gravity","type":"string","nullable":true,"metadata":{}},{"name":"chief","type":{"type":"struct","fields":[{"name":"_VALUE","type":"long","nullable":true,"metadata":{}},{"name":"_further","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"field","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}

我的场景可以通过将元素“ old”设置为nullable = false并从XML中删除标记来进行复制

java xml apache-spark apache-spark-dataset apache-spark-xml
1个回答
0
投票

实际上,这是由将过滤后的DF分配给相同的原始DF dataset的行引起的。

我能够用:复制它]

val df = df.filter(col("old").isNotNull)

java.lang.NullPointerException... 49消失

在Spark中处理DataFrame时,应避免嵌套转换。将该行更改为:

new_dataset = dataset.filter(col("mandatoryColumnNameInSchema").isNotNull())
new_dataset.show()

应该工作

© www.soinside.com 2019 - 2024. All rights reserved.