如何从spark数据帧中过滤掉null值

问题描述 投票:40回答:9

我使用以下模式在spark中创建了一个数据框:

root
 |-- user_id: long (nullable = false)
 |-- event_id: long (nullable = false)
 |-- invited: integer (nullable = false)
 |-- day_diff: long (nullable = true)
 |-- interested: integer (nullable = false)
 |-- event_owner: long (nullable = false)
 |-- friend_id: long (nullable = false)

数据如下所示:

+----------+----------+-------+--------+----------+-----------+---------+
|   user_id|  event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
|   4236494| 110357109|      0|      -1|         0|  937597069|     null|
|  78065188| 498404626|      0|       0|         0| 2904922087|     null|
| 282487230|2520855981|      0|      28|         0| 3749735525|     null|
| 335269852|1641491432|      0|       2|         0| 1490350911|     null|
| 437050836|1238456614|      0|       2|         0|  991277599|     null|
| 447244169|2095085551|      0|      -1|         0| 1579858878|     null|
| 516353916|1076364848|      0|       3|         1| 3597645735|     null|
| 528218683|1151525474|      0|       1|         0| 3433080956|     null|
| 531967718|3632072502|      0|       1|         0| 3863085861|     null|
| 627948360|2823119321|      0|       0|         0| 4092665803|     null|
| 811791433|3513954032|      0|       2|         0|  415464198|     null|
| 830686203|  99027353|      0|       0|         0| 3549822604|     null|
|1008893291|1115453150|      0|       2|         0| 2245155244|     null|
|1239364869|2824096896|      0|       2|         1| 2579294650|     null|
|1287950172|1076364848|      0|       0|         0| 3597645735|     null|
|1345896548|2658555390|      0|       1|         0| 2025118823|     null|
|1354205322|2564682277|      0|       3|         0| 2563033185|     null|
|1408344828|1255629030|      0|      -1|         1|  804901063|     null|
|1452633375|1334001859|      0|       4|         0| 1488588320|     null|
|1625052108|3297535757|      0|       3|         0| 1972598895|     null|
+----------+----------+-------+--------+----------+-----------+---------+

我想过滤掉“friend_id”字段中的行具有空值。

scala> val aaa = test.filter("friend_id is null")

scala> aaa.count

我得到了:res52:Long = 0这显然不对。获得它的正确方法是什么?

还有一个问题,我想替换friend_id字段中的值。我想用0和1替换null,除了null之外的任何其他值。我能弄清楚的代码是:

val aaa = train_friend_join.select($"user_id", $"event_id", $"invited", $"day_diff", $"interested", $"event_owner", ($"friend_id" != null)?1:0)

此代码也不起作用。任何人都可以告诉我如何解决它?谢谢

scala apache-spark apache-spark-sql spark-dataframe
9个回答
55
投票

假设你有这个数据设置(这样结果可以重现):

// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)

// setting up example data
val e1 = Employee("n1", null, "[email protected]", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "[email protected]", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "[email protected]", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "[email protected]", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "[email protected]", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "[email protected]", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "[email protected]", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "[email protected]", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF

数据是:

+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|[email protected]|[c1,1,d1]|
|  n2|   2|[email protected]|[c1,1,d1]|
|  n3|   3|[email protected]|[c1,1,d1]|
|  n4|   4|[email protected]|[c2,2,d2]|
|  n5|null|[email protected]|[c2,2,d2]|
|  n6|   6|[email protected]|[c2,2,d2]|
|  n7|   7|[email protected]|[c3,3,d3]|
|  n8|   8|[email protected]|[c3,3,d3]|
+----+----+---------+---------+

现在用null id过滤员工,你会做 -

df.filter("id is null").show

这将正确显示以下内容:

+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|[email protected]|[c1,1,d1]|
|  n5|null|[email protected]|[c2,2,d2]|
+----+----+---------+---------+

来到你问题的第二部分,你可以用0替换null id和其他值为1 -

df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show

这导致:

+----+---+---------+---------+
|name| id|    email|  company|
+----+---+---------+---------+
|  n1|  0|[email protected]|[c1,1,d1]|
|  n2|  1|[email protected]|[c1,1,d1]|
|  n3|  1|[email protected]|[c1,1,d1]|
|  n4|  1|[email protected]|[c2,2,d2]|
|  n5|  0|[email protected]|[c2,2,d2]|
|  n6|  1|[email protected]|[c2,2,d2]|
|  n7|  1|[email protected]|[c3,3,d3]|
|  n8|  1|[email protected]|[c3,3,d3]|
+----+---+---------+---------+

35
投票

或者像df.filter($"friend_id".isNotNull)


15
投票
df.where(df.col("friend_id").isNull)

15
投票

对我来说一个很好的解决方案是删除任何空值的行:

Dataset<Row> filtered = df.filter(row => !row.anyNull());

如果一个人对另一个案件感兴趣,请拨打row.anyNull()。 (使用Java API的Spark 2.1.0)


10
投票

有两种方法可以做到这一点:创建过滤条件1)手动2)动态。

示例DataFrame:

val df = spark.createDataFrame(Seq(
  (0, "a1", "b1", "c1", "d1"),
  (1, "a2", "b2", "c2", "d2"),
  (2, "a3", "b3", null, "d3"),
  (3, "a4", null, "c4", "d4"),
  (4, null, "b5", "c5", "d5")
)).toDF("id", "col1", "col2", "col3", "col4")

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  0|  a1|  b1|  c1|  d1|
|  1|  a2|  b2|  c2|  d2|
|  2|  a3|  b3|null|  d3|
|  3|  a4|null|  c4|  d4|
|  4|null|  b5|  c5|  d5|
+---+----+----+----+----+

1)手动创建过滤条件,即使用DataFrame wherefilter函数

df.filter(col("col1").isNotNull && col("col2").isNotNull).show

要么

df.where("col1 is not null and col2 is not null").show

结果:

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  0|  a1|  b1|  c1|  d1|
|  1|  a2|  b2|  c2|  d2|
|  2|  a3|  b3|null|  d3|
+---+----+----+----+----+

2)动态创建过滤条件:当我们不希望任何列具有空值并且列数很多(大多数情况下是这种情况)时,这很有用。

在这些情况下手动创建过滤条件会浪费大量时间。在下面的代码中,我们在DataFrame列上使用mapreduce函数动态包含所有列:

val filterCond = df.columns.map(x=>col(x).isNotNull).reduce(_ && _)

filterCond的样子:

filterCond: org.apache.spark.sql.Column = (((((id IS NOT NULL) AND (col1 IS NOT NULL)) AND (col2 IS NOT NULL)) AND (col3 IS NOT NULL)) AND (col4 IS NOT NULL))

过滤:

val filteredDf = df.filter(filterCond)

结果:

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  0|  a1|  b1|  c1|  d1|
|  1|  a2|  b2|  c2|  d2|
+---+----+----+----+----+

1
投票

从Michael Kopaniov的暗示,下面的作品

df.where(df("id").isNotNull).show

1
投票

这是Java中spark的解决方案。选择包含空值的数据行。拥有数据集数据时,您可以:

Dataset<Row> containingNulls =  data.where(data.col("COLUMN_NAME").isNull())

要过滤掉没有空值的数据,您可以:

Dataset<Row> withoutNulls = data.where(data.col("COLUMN_NAME").isNotNull())

通常,数据框包含String类型的列,而不是null,我们有空字符串,如“”。为了过滤掉这些数据,我们做了:

Dataset<Row> withoutNullsAndEmpty = data.where(data.col("COLUMN_NAME").isNotNull().and(data.col("COLUMN_NAME").notEqual("")))

1
投票

对于第一个问题,你正在过滤掉空值是正确的,因此计数为零。

第二次替换:使用如下:

val options = Map("path" -> "...\\ex.csv", "header" -> "true")
val dfNull = spark.sqlContext.load("com.databricks.spark.csv", options)

scala> dfNull.show

+----------+----------+-------+--------+----------+-----------+---------+
|   user_id|  event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
|   4236494| 110357109|      0|      -1|         0|  937597069|     null|
|  78065188| 498404626|      0|       0|         0| 2904922087|     null|
| 282487230|2520855981|      0|      28|         0| 3749735525|     null|
| 335269852|1641491432|      0|       2|         0| 1490350911|     null|
| 437050836|1238456614|      0|       2|         0|  991277599|     null|
| 447244169|2095085551|      0|      -1|         0| 1579858878|        a|
| 516353916|1076364848|      0|       3|         1| 3597645735|        b|
| 528218683|1151525474|      0|       1|         0| 3433080956|        c|
| 531967718|3632072502|      0|       1|         0| 3863085861|     null|
| 627948360|2823119321|      0|       0|         0| 4092665803|     null|
| 811791433|3513954032|      0|       2|         0|  415464198|     null|
| 830686203|  99027353|      0|       0|         0| 3549822604|     null|
|1008893291|1115453150|      0|       2|         0| 2245155244|     null|
|1239364869|2824096896|      0|       2|         1| 2579294650|        d|
|1287950172|1076364848|      0|       0|         0| 3597645735|     null|
|1345896548|2658555390|      0|       1|         0| 2025118823|     null|
|1354205322|2564682277|      0|       3|         0| 2563033185|     null|
|1408344828|1255629030|      0|      -1|         1|  804901063|     null|
|1452633375|1334001859|      0|       4|         0| 1488588320|     null|
|1625052108|3297535757|      0|       3|         0| 1972598895|     null|
+----------+----------+-------+--------+----------+-----------+---------+

dfNull.withColumn("friend_idTmp", when($"friend_id".isNull, "1").otherwise("0")).drop($"friend_id").withColumnRenamed("friend_idTmp", "friend_id").show

+----------+----------+-------+--------+----------+-----------+---------+
|   user_id|  event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
|   4236494| 110357109|      0|      -1|         0|  937597069|        1|
|  78065188| 498404626|      0|       0|         0| 2904922087|        1|
| 282487230|2520855981|      0|      28|         0| 3749735525|        1|
| 335269852|1641491432|      0|       2|         0| 1490350911|        1|
| 437050836|1238456614|      0|       2|         0|  991277599|        1|
| 447244169|2095085551|      0|      -1|         0| 1579858878|        0|
| 516353916|1076364848|      0|       3|         1| 3597645735|        0|
| 528218683|1151525474|      0|       1|         0| 3433080956|        0|
| 531967718|3632072502|      0|       1|         0| 3863085861|        1|
| 627948360|2823119321|      0|       0|         0| 4092665803|        1|
| 811791433|3513954032|      0|       2|         0|  415464198|        1|
| 830686203|  99027353|      0|       0|         0| 3549822604|        1|
|1008893291|1115453150|      0|       2|         0| 2245155244|        1|
|1239364869|2824096896|      0|       2|         1| 2579294650|        0|
|1287950172|1076364848|      0|       0|         0| 3597645735|        1|
|1345896548|2658555390|      0|       1|         0| 2025118823|        1|
|1354205322|2564682277|      0|       3|         0| 2563033185|        1|
|1408344828|1255629030|      0|      -1|         1|  804901063|        1|
|1452633375|1334001859|      0|       4|         0| 1488588320|        1|
|1625052108|3297535757|      0|       3|         0| 1972598895|        1|
+----------+----------+-------+--------+----------+-----------+---------+

0
投票

我使用以下代码来解决我的问题。有用。但众所周知,我在一个国家的工作中解决它。那么,有没有捷径?谢谢

def filter_null(field : Any) : Int = field match {
    case null => 0
    case _    => 1
}

val test = train_event_join.join(
    user_friends_pair,
    train_event_join("user_id") === user_friends_pair("user_id") &&
    train_event_join("event_owner") === user_friends_pair("friend_id"),
    "left"
).select(
    train_event_join("user_id"),
    train_event_join("event_id"),
    train_event_join("invited"),
    train_event_join("day_diff"),
    train_event_join("interested"),
    train_event_join("event_owner"),
    user_friends_pair("friend_id")
).rdd.map{
    line => (
        line(0).toString.toLong,
        line(1).toString.toLong,
        line(2).toString.toLong,
        line(3).toString.toLong,
        line(4).toString.toLong,
        line(5).toString.toLong,
        filter_null(line(6))
        )
    }.toDF("user_id", "event_id", "invited", "day_diff", "interested", "event_owner", "creator_is_friend")
© www.soinside.com 2019 - 2024. All rights reserved.