数据集过滤器工作异常

Question

场景：我已经通过在加载时指定架构读取了两个XML文件。

在架构中，标记之一是必需的。一种XML缺少该必需标记。

现在，当我执行以下操作时，我期望具有必需标记的XML被过滤掉。

dataset = dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull());

在代码中，当我尝试计算数据集的行数时，我得到的计数为2（2个输入XMLS），但是当我尝试通过show（）方法打印数据集时，我得到了NPE。

调试上面的行并执行以下操作时，计数为0。

dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull()).count();

问题：

任何人都可以在下面回答问题/确认我的理解

为什么spark数据集不过滤没有必填列的行？
为什么计数没有例外，但show方法为何？

对于2，我相信该计数只是在不查看内容的情况下对行数进行计数。为了进行显示，迭代器实际上遍历了“结构体字段”以打印其值，并且当找不到强制列时，它就会出错。

P.S。如果我将强制性列设为可选，则一切正常。

编辑：

根据要求提供阅读选项

为了加载数据，我正在执行以下操作

Dataset<Row> dataset = sparkSession.read().format("com.databricks.spark.xml")
                .option("header", "true")
                .option("inferSchema", "false")
                .option("rowTag", rowTag)//rowTag is "body" tag in the XML
                .option("failFast", "true")
                .option("mode", "FAILFAST")
                .schema(schema)
                .load(XMLfilePath);

根据要求提供样品

模式：

root
 |-- old: struct (nullable = true)
 |    |-- _beyond: string (nullable = true)
 |    |-- lot: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _chose: string (nullable = true)
 |    |-- real: struct (nullable = true)
 |    |    |-- _eat: string (nullable = true)
 |    |    |-- kill: struct (nullable = true)
 |    |    |    |-- _VALUE: double (nullable = true)
 |    |    |    |-- _top: string (nullable = true)
 |    |    |-- tool: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _affect: string (nullable = true)
 |-- porch: struct (nullable = true)
 |    |-- _account: string (nullable = true)
 |    |-- cast: string (nullable = true)
 |    |-- vegetable: struct (nullable = true)
 |    |    |-- leg: struct (nullable = true)
 |    |    |    |-- _VALUE: double (nullable = true)
 |    |    |    |-- _nose: string (nullable = true)
 |    |    |-- now: struct (nullable = true)
 |    |    |    |-- _gravity: string (nullable = true)
 |    |    |    |-- chief: struct (nullable = true)
 |    |    |    |    |-- _VALUE: long (nullable = true)
 |    |    |    |    |-- _further: string (nullable = true)
 |    |    |    |-- field: string (nullable = true)

示例XML：

<?xml version="1.0" encoding="UTF-8" ?>
<root>
    <body>
    <porch account="something">
        <vegetable>
            <now gravity="wide">
                <field>box</field>
                <chief further="satisfied">-1889614487</chief>
            </now>
            <leg nose="angle">912658017.229279</leg>
        </vegetable>
        <cast>clear</cast>
    </porch>
    <old beyond="continent">
        <real eat="term">
            <kill top="plates">-1623084908.8669372</kill>
            <tool affect="pond">today</tool>
        </real>
        <lot chose="swung">promised</lot>
    </old>
    </body>
</root>

JSON格式的架构：

{"type":"struct","fields":[{"name":"old","type":{"type":"struct","fields":[{"name":"_beyond","type":"string","nullable":true,"metadata":{}},{"name":"lot","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_chose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"real","type":{"type":"struct","fields":[{"name":"_eat","type":"string","nullable":true,"metadata":{}},{"name":"kill","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_top","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"tool","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_affect","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"porch","type":{"type":"struct","fields":[{"name":"_account","type":"string","nullable":true,"metadata":{}},{"name":"cast","type":"string","nullable":true,"metadata":{}},{"name":"vegetable","type":{"type":"struct","fields":[{"name":"leg","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_nose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"now","type":{"type":"struct","fields":[{"name":"_gravity","type":"string","nullable":true,"metadata":{}},{"name":"chief","type":{"type":"struct","fields":[{"name":"_VALUE","type":"long","nullable":true,"metadata":{}},{"name":"_further","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"field","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}

我的场景可以通过将元素“ old”设置为nullable = false并从XML中删除标记来进行复制

Answer 1

实际上，这是由将过滤后的DF分配给相同的原始DF dataset的行引起的。

我能够用：复制它]

val df = df.filter(col("old").isNotNull)
java.lang.NullPointerException... 49消失

在Spark中处理DataFrame时，应避免嵌套转换。将该行更改为：

new_dataset = dataset.filter(col("mandatoryColumnNameInSchema").isNotNull())
new_dataset.show()
应该工作

数据集过滤器工作异常

问题描述投票：1回答：1

1个回答

最新问题

数据集过滤器工作异常

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1