我有一个要求,如果匹配的 df2 具有空值,则它应该匹配。默认情况下,Spark 不匹配带有 null 的行。
df1-
ID Name City EMAIL
1 John City A [email protected]
2 Mist City B [email protected]
3 Danny City C [email protected]
df2-
ID Name City EMAIL
1 John City A [email protected]
2 null City B [email protected]
3 Danny City C [email protected]
df3 = df1.join(df2, on=["ID","NAME","CITY"]) display(df3)
火花输出 -
ID Name City EMAIL EMAIL
1 John City A [email protected] [email protected]
3 Danny City C [email protected] [email protected]
预期输出 -
ID Name City EMAIL EMAIL
1 John City A [email protected] [email protected]
2 Mist City B [email protected] [email protected]
3 Danny City C [email protected] [email protected]
如上所示,由于 ID 和 CITY 匹配且 NAME 具有空值,因此连接应该匹配并给出预期结果。
而且我无法在加入列时删除 NAME,它应该与 NAME 匹配,只是如果 NAME 为空,那么这些列也应该匹配。
请帮忙
试试这个:
df3 = df1.join(
df2,
on=[
df1.ID == df2.ID,
df1.City == df2.City,
(df1.Name == df2.Name) | df2.Name.isNull()
]
).select(df1.ID, df1.Name, df1.City, df1.EMAIL, df2.EMAIL)
df3.show()
输出:
+---+-----+------+-----------+-----------+
| ID| Name| City| EMAIL| EMAIL|
+---+-----+------+-----------+-----------+
| 1| John|City A|[email protected]|[email protected]|
| 2| Mist|City B|[email protected]|[email protected]|
| 3|Danny|City C|[email protected]|[email protected]|
+---+-----+------+-----------+-----------+