PySpark:如何根据多个条件附加来自其他 pyspark 数据框的新列?

问题描述 投票:0回答:0

我有 pyspark df1

|id |name   |        email    |age|college|
|---| ------+ ----------------+---+-------|
|12 | Sta   |[email protected]  |25 |clg1   |
|21 |Danny  |[email protected] |23 |clg2   |
|37 |Elle   |[email protected] |27 |clg3   |
|40 |Mark   |[email protected]|40 |clg4   |
|36 |John   |[email protected]  |32 |clg5   |

我有 pyspark df2

|id |name   |age  |
+---+-------+ ----+
|36 | Sta   |30   |
|12 | raj   |25   |
|29 | jack  |33   |
|87 | Mark  |67   |
|75 | Alle  |23   |
|89 |Jalley |32   |
|55 |kale   |99   |

现在我想加入 df2 和 df1 以获得 df2 附加的电子邮件和学院 在以下条件下:

if df1 id equals df2 id or df1 name equals df2 name df1 age equals  df2 age if nothing matches fill NULL

In other words if first condition matched then it should not match with other condition, if first condition does not match then it should consider the other conditions to be matched subsequently if none of them match then fill Null.

例如

df2应该变成这样

|id|name    |age |email             |college
|--| ------ | ---|------------------|-----
|36| Sta    |30  |[email protected]   |clg5
|12| raj    |25  |[email protected]   |clg1
|29| jack   |33  |NULL              |NULL
|87| Mark   |67  |[email protected] |clg4
|75| Alle   |23  |[email protected]  |clg2
|89| Jalley |32  |[email protected]   |clg5
|55| kale   |99  |NULL              |NULL

我尝试了很多内置连接功能,但未能实现,也尝试创建 udf,但效率非常低。

此外,数据太大,无法在其中应用任何 udf 并在 spark cluster 3.x 上运行

dataframe dictionary apache-spark pyspark rdd
© www.soinside.com 2019 - 2024. All rights reserved.