完整示例此处。
我看到两种不同的输出,这两种方法在 pyspark 中的两个数据帧上进行完整的外部联接:
users1_df. \
join(users2_df, users1_df.email == users2_df.email, 'full_outer'). \
show()
这给出:
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
| email|first_name|last_name|gender| ip_address| email|first_name| last_name|gender| ip_address|
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
| [email protected]| Aundrea| Lovett| Male| 62.72.1.143| null| null| null| null| null|
|[email protected]| Bettine| Jowling|Female|26.250.197.47|[email protected]| Putnam|Alfonsetti|Female| 167.97.48.246|
| null| null| null| null| null| [email protected]| Lilas| Butland|Female|109.110.131.151|
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
请注意,
email
列是重复的,并且两个数据框中都不存在的电子邮件有空值。
现在使用以下代码:
users1_df. \
join(users2_df, 'email', 'full_outer'). \
show()
我得到以下信息:
+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+
| email|first_name|last_name|gender| ip_address|first_name| last_name|gender| ip_address|
+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+
| [email protected]| Aundrea| Lovett| Male| 62.72.1.143| null| null| null| null|
|[email protected]| Bettine| Jowling|Female|26.250.197.47| Putnam|Alfonsetti|Female| 167.97.48.246|
| [email protected]| null| null| null| null| Lilas| Butland|Female|109.110.131.151|
+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+
请注意,
email
列没有重复,也没有空值。
我错过了什么吗? pyspark.join 的文档中在哪里提到了这种行为?
我必须进行一些挖掘才能理解官方文档和代码中的差距。以下是我的发现:
您已经提到了使用两种不同方法的
join
方法调用:
users1_df. \
join(users2_df, users1_df.email == users2_df.email, 'full_outer'). \
show()
这里后端代码具有以下信息作为函数doc:
当明确说明连接条件时:
,这将生成名称匹配的所有记录以及不匹配的记录(因为它是外连接)。如果df.name == df2.name
中存在df2
中不存在的名称,它们将与df
一起出现在NULL
的name
列中,反之亦然。 这意味着当您执行操作df
df2
时,我们预计会看到所有列,包括重复的联接列
show
,从而产生以下输出:email
2.隐式连接条件
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
|email |first_name|last_name|gender|ip_address |email |first_name|last_name |gender|ip_address |
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
|[email protected] |Aundrea |Lovett |Male |62.72.1.143 |null |null |null |null |null |
|[email protected]|Bettine |Jowling |Female|26.250.197.47|[email protected]|Putnam |Alfonsetti|Female|167.97.48.246 |
|null |null |null |null |null |[email protected] |Lilas |Butland |Female|109.110.131.151|
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
方法的隐式语法:
join
类似的 GitHub 函数doc当您直接提供列名称作为连接条件时,Spark 会将两个名称列视为一个列,并且不会为
users1_df. \ join(users2_df, 'email', 'full_outer'). \ show()
和
生成单独的列。这可以避免输出中出现重复的列。 这意味着显式连接条件的结果等于以下隐式调用:df.name
df2.name
在这里,我调用了隐式语法,但分别在两个数据帧
(users1_df.alias("u1")
.join(users2_df.alias("u2"), 'email', 'full_outer')
.selectExpr("u1.email", "u1.first_name", "u1.last_name", "u1.gender", "u1.ip_address",
"u2.email", "u2.first_name", "u2.last_name", "u2.gender", "u2.ip_address")
.show(1000, False))
和
u1
上使用了别名(u2
和users1_df
),然后最终显式调用了所有列,输出与输出类似由显式连接条件产生:users2_df
有趣的是,如果我们在连接列
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
|email |first_name|last_name|gender|ip_address |email |first_name|last_name |gender|ip_address |
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
|[email protected] |Aundrea |Lovett |Male |62.72.1.143 |null |null |null |null |null |
|[email protected]|Bettine |Jowling |Female|26.250.197.47|[email protected]|Putnam |Alfonsetti|Female|167.97.48.246 |
|null |null |null |null |null |[email protected] |Lilas |Butland |Female|109.110.131.151|
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
上应用
coalesce
方法,我们会得到只有一个 email
列的输出:email
产生以下输出:
(users1_df.alias("u1")
.join(users2_df.alias("u2"), 'email', 'full_outer')
.selectExpr("coalesce(u1.email, u2.email) as email", "u1.first_name", "u1.last_name", "u1.gender",
"u1.ip_address", "u2.first_name", "u2.last_name", "u2.gender", "u2.ip_address")
.show(1000, False))
此
+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+
|email |first_name|last_name|gender|ip_address |first_name|last_name |gender|ip_address |
+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+
|[email protected] |Aundrea |Lovett |Male |62.72.1.143 |null |null |null |null |
|[email protected]|Bettine |Jowling |Female|26.250.197.47|Putnam |Alfonsetti|Female|167.97.48.246 |
|[email protected] |null |null |null |null |Lilas |Butland |Female|109.110.131.151|
+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+
方法根据 GitHub 中的文档生成唯一的
coalesce
列,并且它只是 避免在输出中出现重复的列。这是我的一个实验,旨在解释 pyspark 的内部结构如何决定如何在隐式连接期间生成输出,如果我掌握了执行此操作的方法,我将相应地更新此答案。