pyspark - 这两个完整外连接有什么区别？

Question

完整示例此处。

我看到两种不同的输出，这两种方法在 pyspark 中的两个数据帧上进行完整的外部联接：

users1_df. \
    join(users2_df, users1_df.email == users2_df.email, 'full_outer'). \
    show()

这给出：

+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
|               email|first_name|last_name|gender|   ip_address|               email|first_name| last_name|gender|     ip_address|
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
| [email protected]|   Aundrea|   Lovett|  Male|  62.72.1.143|                null|      null|      null|  null|           null|
|[email protected]|   Bettine|  Jowling|Female|26.250.197.47|[email protected]|    Putnam|Alfonsetti|Female|  167.97.48.246|
|                null|      null|     null|  null|         null|  [email protected]|     Lilas|   Butland|Female|109.110.131.151|
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+

请注意，

email

列是重复的，并且两个数据框中都不存在的电子邮件有空值。

现在使用以下代码：

users1_df. \
    join(users2_df, 'email', 'full_outer'). \
    show()

我得到以下信息：

+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+
|               email|first_name|last_name|gender|   ip_address|first_name| last_name|gender|     ip_address|
+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+
| [email protected]|   Aundrea|   Lovett|  Male|  62.72.1.143|      null|      null|  null|           null|
|[email protected]|   Bettine|  Jowling|Female|26.250.197.47|    Putnam|Alfonsetti|Female|  167.97.48.246|
|  [email protected]|      null|     null|  null|         null|     Lilas|   Butland|Female|109.110.131.151|
+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+

请注意，

email

列没有重复，也没有空值。

我错过了什么吗？ pyspark.join 的文档中在哪里提到了这种行为？

Answer 1

我必须进行一些挖掘才能理解官方文档和代码中的差距。以下是我的发现：

您已经提到了使用两种不同方法的

join

方法调用：

1.显式连接条件

users1_df. \
    join(users2_df, users1_df.email == users2_df.email, 'full_outer'). \
    show()

这里后端代码具有以下信息作为函数doc：

当明确说明连接条件时：
df.name == df2.name
，这将生成名称匹配的所有记录以及不匹配的记录（因为它是外连接）。如果
df2
中存在
df
中不存在的名称，它们将与
NULL
一起出现在
name
的
df
列中，反之亦然。
这意味着当您执行操作

df2

时，我们预计会看到所有列，包括重复的联接列

show

，从而产生以下输出：

email

2.隐式连接条件

另一种风格是使用

+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
|email               |first_name|last_name|gender|ip_address   |email               |first_name|last_name |gender|ip_address     |
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
|[email protected] |Aundrea   |Lovett   |Male  |62.72.1.143  |null                |null      |null      |null  |null           |
|[email protected]|Bettine   |Jowling  |Female|26.250.197.47|[email protected]|Putnam    |Alfonsetti|Female|167.97.48.246  |
|null                |null      |null     |null  |null         |[email protected]  |Lilas     |Butland   |Female|109.110.131.151|
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+

方法的隐式语法：

join

类似的 GitHub 函数

doc

指出：

当您直接提供列名称作为连接条件时，Spark 会将两个名称列视为一个列，并且不会为

users1_df. \ join(users2_df, 'email', 'full_outer'). \ show()
和
df.name
生成单独的列。这可以避免输出中出现重复的列。
这意味着显式连接条件的结果等于以下隐式调用：

df2.name

在这里，我调用了隐式语法，但分别在两个数据帧

(users1_df.alias("u1")
 .join(users2_df.alias("u2"), 'email', 'full_outer')
 .selectExpr("u1.email", "u1.first_name", "u1.last_name", "u1.gender", "u1.ip_address",
             "u2.email", "u2.first_name", "u2.last_name", "u2.gender", "u2.ip_address")
 .show(1000, False))

和

u1

上使用了别名（

u2

和

users1_df

），然后最终显式调用了所有列，输出与输出类似由显式连接条件产生：

users2_df

有趣的是，如果我们在连接列

+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
|email               |first_name|last_name|gender|ip_address   |email               |first_name|last_name |gender|ip_address     |
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+
|[email protected] |Aundrea   |Lovett   |Male  |62.72.1.143  |null                |null      |null      |null  |null           |
|[email protected]|Bettine   |Jowling  |Female|26.250.197.47|[email protected]|Putnam    |Alfonsetti|Female|167.97.48.246  |
|null                |null      |null     |null  |null         |[email protected]  |Lilas     |Butland   |Female|109.110.131.151|
+--------------------+----------+---------+------+-------------+--------------------+----------+----------+------+---------------+

上应用

coalesce

方法，我们会得到只有一个

email

列的输出：

email

产生以下输出：

(users1_df.alias("u1") .join(users2_df.alias("u2"), 'email', 'full_outer') .selectExpr("coalesce(u1.email, u2.email) as email", "u1.first_name", "u1.last_name", "u1.gender", "u1.ip_address", "u2.first_name", "u2.last_name", "u2.gender", "u2.ip_address") .show(1000, False))

此

+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+
|email               |first_name|last_name|gender|ip_address   |first_name|last_name |gender|ip_address     |
+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+
|[email protected] |Aundrea   |Lovett   |Male  |62.72.1.143  |null      |null      |null  |null           |
|[email protected]|Bettine   |Jowling  |Female|26.250.197.47|Putnam    |Alfonsetti|Female|167.97.48.246  |
|[email protected]  |null      |null     |null  |null         |Lilas     |Butland   |Female|109.110.131.151|
+--------------------+----------+---------+------+-------------+----------+----------+------+---------------+

方法根据 GitHub 中的文档生成唯一的

coalesce

列，并且它只是

避免在输出中出现重复的列

。这是我的一个实验，旨在解释 pyspark 的内部结构如何决定如何在隐式连接期间生成输出，如果我掌握了执行此操作的方法，我将相应地更新此答案。

pyspark - 这两个完整外连接有什么区别？

问题描述投票：0回答：1

1个回答

1.显式连接条件

另一种风格是使用

最新问题

pyspark - 这两个完整外连接有什么区别？

问题描述 投票：0回答：1

1个回答

1.显式连接条件

另一种风格是使用

最新问题

问题描述投票：0回答：1