场景:我已经通过在加载时指定架构读取了两个XML文件。
在架构中,标记之一是必需的。一种XML缺少该必需标记。
现在,当我执行以下操作时,我期望具有必需标记的XML被过滤掉。
dataset = dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull());
在代码中,当我尝试计算数据集的行数时,我得到的计数为2(2个输入XMLS),但是当我尝试通过show()方法打印数据集时,我得到了NPE。
调试上面的行并执行以下操作时,计数为0。
dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull()).count();
问题:
任何人都可以在下面回答问题/确认我的理解
对于2,我相信该计数只是在不查看内容的情况下对行数进行计数。为了进行显示,迭代器实际上遍历了“结构体字段”以打印其值,并且当找不到强制列时,它就会出错。
P.S。如果我将强制性列设为可选,则一切正常。
编辑:
根据要求提供阅读选项
为了加载数据,我正在执行以下操作
Dataset<Row> dataset = sparkSession.read().format("com.databricks.spark.xml")
.option("header", "true")
.option("inferSchema", "false")
.option("rowTag", rowTag)//rowTag is "body" tag in the XML
.option("failFast", "true")
.option("mode", "FAILFAST")
.schema(schema)
.load(XMLfilePath);
根据要求提供样品
模式:
root
|-- old: struct (nullable = true)
| |-- _beyond: string (nullable = true)
| |-- lot: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _chose: string (nullable = true)
| |-- real: struct (nullable = true)
| | |-- _eat: string (nullable = true)
| | |-- kill: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _top: string (nullable = true)
| | |-- tool: struct (nullable = true)
| | | |-- _VALUE: string (nullable = true)
| | | |-- _affect: string (nullable = true)
|-- porch: struct (nullable = true)
| |-- _account: string (nullable = true)
| |-- cast: string (nullable = true)
| |-- vegetable: struct (nullable = true)
| | |-- leg: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _nose: string (nullable = true)
| | |-- now: struct (nullable = true)
| | | |-- _gravity: string (nullable = true)
| | | |-- chief: struct (nullable = true)
| | | | |-- _VALUE: long (nullable = true)
| | | | |-- _further: string (nullable = true)
| | | |-- field: string (nullable = true)
示例XML:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<body>
<porch account="something">
<vegetable>
<now gravity="wide">
<field>box</field>
<chief further="satisfied">-1889614487</chief>
</now>
<leg nose="angle">912658017.229279</leg>
</vegetable>
<cast>clear</cast>
</porch>
<old beyond="continent">
<real eat="term">
<kill top="plates">-1623084908.8669372</kill>
<tool affect="pond">today</tool>
</real>
<lot chose="swung">promised</lot>
</old>
</body>
</root>
JSON格式的架构:
{"type":"struct","fields":[{"name":"old","type":{"type":"struct","fields":[{"name":"_beyond","type":"string","nullable":true,"metadata":{}},{"name":"lot","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_chose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"real","type":{"type":"struct","fields":[{"name":"_eat","type":"string","nullable":true,"metadata":{}},{"name":"kill","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_top","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"tool","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_affect","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"porch","type":{"type":"struct","fields":[{"name":"_account","type":"string","nullable":true,"metadata":{}},{"name":"cast","type":"string","nullable":true,"metadata":{}},{"name":"vegetable","type":{"type":"struct","fields":[{"name":"leg","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_nose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"now","type":{"type":"struct","fields":[{"name":"_gravity","type":"string","nullable":true,"metadata":{}},{"name":"chief","type":{"type":"struct","fields":[{"name":"_VALUE","type":"long","nullable":true,"metadata":{}},{"name":"_further","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"field","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}
我的场景可以通过将元素“ old”设置为nullable = false并从XML中删除标记来进行复制
实际上,这是由将过滤后的DF分配给相同的原始DF dataset
的行引起的。
我能够用:复制它]
val df = df.filter(col("old").isNotNull)
java.lang.NullPointerException... 49消失
在Spark中处理DataFrame时,应避免嵌套转换。将该行更改为:
new_dataset = dataset.filter(col("mandatoryColumnNameInSchema").isNotNull()) new_dataset.show()
应该工作