Spark数据帧无法比较空值。

问题描述 投票:0回答:2

Hi All I have 2 dataframes in I am comparing values of both the dataframe and based on value assigning value to one new dataframe.All the scenarios are working fine expect null fields comparision i.e. if in both the dataframe values are null then it should show as "varified" but its giving me as "not varified" I am sharing my dataframes data and code which I'm using and result of final dataframe below.

    scala> df1.show()
    +---+-----+---+--------+------+-------+
    | id| name|age|lastname|  city|country|
    +---+-----+---+--------+------+-------+
    |  1|rohan| 26|  sharma|mumbai|  india|
    |  2|rohan| 26|  sharma|  null|  india|
    |  3|rohan| 26|    null|mumbai|  india|
    |  4|rohan| 26|  sharma|mumbai|  india|
    +---+-----+---+--------+------+-------+
    scala> df2.show()
    +----+------+-----+----------+------+---------+
    |o_id|o_name|o_age|o_lastname|o_city|o_country|
    +----+------+-----+----------+------+---------+
    |   1| rohan|   26|    sharma|mumbai|    india|
    |   2| rohan|   26|    sharma|  null|    india|
    |   3| rohan|   26|    sharma|mumbai|    india|
    |   4| rohan|   26|      null|mumbai|    india|
    +----+------+-----+----------+------+---------+

    val df3 = df1.join(df2, df1("id") === df2("o_id"))
    .withColumn("result", when(df1("name") === df2("o_name") && 
    df1("age") === df2("o_age") && 
    df1("lastname") === df2("o_lastname") && 
    df1("city") === df2("o_city")  &&
    df1("country") === df2("o_country"), "Varified")
    .otherwise("Not Varified")).show()

    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
    | id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
    |  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
    |  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|Not Varified|
    |  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
    |  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+

I want that for id '2' also it should show as 'Varified'.but the city is null in both the column then its showing as 'Not Varified'.Can someone please guide me how should I Modify my df3 query so it can check null also and for id '2' also can show as 'Varified' in result column.

scala apache-spark apache-spark-sql pyspark-dataframes
2个回答
1
投票

使用 <=> 而不是 ===

val df3 = df1.join(df2, df1("id") === df2("o_id"))
    .withColumn("result", when(df1("name") <=> df2("o_name") && 
    df1("age") <=> df2("o_age") && 
    df1("lastname") <=> df2("o_lastname") && 
    df1("city") <=> df2("o_city")  &&
    df1("country") <=> df2("o_country"), "Varified")
    .otherwise("Not Varified")).show()
spark.sql("SELECT NULL AS city1, NULL AS city2").select($"city1" <=> $"city2").show

结果

+-----------------+
|(city1 <=> city2)|
+-----------------+
|            true |
+-----------------+

0
投票

在你的 when+otherwise 声明添加 <=> (或) || 运营商和检查 .isNull 对于 last_name and city 列。

null=null 返回 null 我们无法匹配的背后原因。

spark.sql("select null=null").show()
//+-------------+
//|(NULL = NULL)|
//+-------------+
//|         null|
//+-------------+

Using <=>,isnull():

spark.sql("select null<=>null, isnull(null) = isnull(null)").show()
//+---------------+---------------------------------+
//|(NULL <=> NULL)|((NULL IS NULL) = (NULL IS NULL))|
//+---------------+---------------------------------+
//|           true|                             true|
//+---------------+---------------------------------+

Example:

df1.join(df2, df1("id") === df2("o_id")).
withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age") ) && 
(df1("lastname") === df2("o_lastname")|| (df1("lastname").isNull === df2("o_lastname").isNull)) && 
(df1("city") === df2("o_city")|| (df1("city").isNull === df2("o_city").isNull))  && 
(df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).
show()

//or using <>
df1.join(df2, df1("id") === df2("o_id")).withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age")) && (df1("lastname") <=> df2("o_lastname")) && (df1("city") <=> df2("o_city"))  && (df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).show()

//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//| id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//|  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
//|  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|    Varified|
//|  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
//|  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
© www.soinside.com 2019 - 2024. All rights reserved.