PySpark:连接第二个数据帧中的 NULL 值应该匹配

问题描述 投票:0回答:1

我有一个要求,如果匹配的 df2 具有空值,则它应该匹配。默认情况下,Spark 不匹配带有 null 的行。

df1-

ID  Name    City    EMAIL
1   John    City A  [email protected]
2   Mist    City B  [email protected]
3   Danny   City C  [email protected]

df2-

ID  Name    City    EMAIL
1   John    City A  [email protected]
2   null    City B  [email protected]
3   Danny   City C  [email protected]

df3 = df1.join(df2, on=["ID","NAME","CITY"]) display(df3)

火花输出 -

ID  Name    City    EMAIL   EMAIL
1   John    City A  [email protected] [email protected]
3   Danny   City C  [email protected] [email protected]

预期输出 -

ID  Name    City    EMAIL           EMAIL
1   John    City A  [email protected] [email protected]
2   Mist    City B  [email protected] [email protected]
3   Danny   City C  [email protected] [email protected]

如上所示,由于 ID 和 CITY 匹配且 NAME 具有空值,因此连接应该匹配并给出预期结果。

而且我无法在加入列时删除 NAME,它应该与 NAME 匹配,只是如果 NAME 为空,那么这些列也应该匹配。

请帮忙

apache-spark join pyspark databricks
1个回答
0
投票

试试这个:

df3 = df1.join(
    df2, 
    on=[
        df1.ID == df2.ID, 
        df1.City == df2.City, 
        (df1.Name == df2.Name) | df2.Name.isNull()
    ]
).select(df1.ID, df1.Name, df1.City, df1.EMAIL, df2.EMAIL)
df3.show()

输出:

+---+-----+------+-----------+-----------+
| ID| Name|  City|      EMAIL|      EMAIL|
+---+-----+------+-----------+-----------+
|  1| John|City A|[email protected]|[email protected]|
|  2| Mist|City B|[email protected]|[email protected]|
|  3|Danny|City C|[email protected]|[email protected]|
+---+-----+------+-----------+-----------+
© www.soinside.com 2019 - 2024. All rights reserved.