删除火花中不遵循模式的行

问题描述 投票:1回答:1

当前,我的表的模式是:

root
 |-- product_id: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- aisle_id: string (nullable = true)
 |-- department_id: string (nullable = true)

我想在上表中应用以下架构,并删除所有不遵循以下架构的行:

val productsSchema = StructType(Seq(
    StructField("product_id",IntegerType,nullable = true),
    StructField("product_name",StringType,nullable = true),
    StructField("aisle_id",IntegerType,nullable = true),
    StructField("department_id",IntegerType,nullable = true)
  ))
scala apache-spark filter rows drop
1个回答
0
投票

数据与架构不匹配,spark将为列添加null。我们只需要过滤所有列的空值即可。

filter过滤所有列的“ null”值。

scala> "cat /tmp/sample.json".! // JSON File Data, one row is not matching with schema.
{"product_id":1,"product_name":"sampleA","aisle_id":"AA","department_id":"AAD"}
{"product_id":2,"product_name":"sampleBB","aisle_id":"AAB","department_id":"AADB"}
{"product_id":3,"product_name":"sampleCC","aisle_id":"CC","department_id":"CCC"}
{"product_id":3,"product_name":"sampledd","aisle_id":"dd","departmentId":"ddd"}
{"name","srinivas","age":29}
res100: Int = 0

scala> schema.printTreeString
root
 |-- aisle_id: string (nullable = true)
 |-- department_id: string (nullable = true)
 |-- product_id: long (nullable = true)
 |-- product_name: string (nullable = true)


scala> val df = spark.read.schema(schema).option("badRecordsPath", "/tmp/badRecordsPath").format("json").load("/tmp/sample.json") // Loading Json data & if schema is not matching we will be getting null rows for all columns.
df: org.apache.spark.sql.DataFrame = [aisle_id: string, department_id: string ... 2 more fields]

scala> df.show(false)
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA      |AAD          |1         |sampleA     |
|AAB     |AADB         |2         |sampleBB    |
|CC      |CCC          |3         |sampleCC    |
|dd      |null         |3         |sampledd    |
|null    |null         |null      |null        |
+--------+-------------+----------+------------+


scala> df.filter(df.columns.map(c => s"${c} is not null").mkString(" or ")).show(false) // Filter null rows.
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA      |AAD          |1         |sampleA     |
|AAB     |AADB         |2         |sampleBB    |
|CC      |CCC          |3         |sampleCC    |
|dd      |null         |3         |sampledd    |
+--------+-------------+----------+------------+


scala>
© www.soinside.com 2019 - 2024. All rights reserved.