我正在努力获得2个数据帧的CROSS JOIN。我正在使用spark 2.0。如何用2个数据帧实现CROSSS JOIN。
编辑:
val df=df.join(df_t1, df("Col1")===df_t1("col")).join(df2,joinType=="cross join").where(df("col2")===df2("col2"))
升级到spark-sql_2.11版本2.1.0的最新版本并使用数据集的.crossJoin函数
如果不需要指定条件,请使用crossJoin
以下是工作代码的摘录:
people.crossJoin(area).show()
在不使用连接条件的情况下调用其他数据帧的连接。
看看下面的例子。给出第一个人的数据框:
+---+------+-------+------+
| id| name| mail|idArea|
+---+------+-------+------+
| 1| Jack|[email protected]| 1|
| 2|Valery|[email protected]| 1|
| 3| Karl|[email protected]| 2|
| 4| Nick|[email protected]| 2|
| 5| Luke|[email protected]| 3|
| 6| Marek|[email protected]| 3|
+---+------+-------+------+
区域的第二个数据框:
+------+--------------+
|idArea| areaName|
+------+--------------+
| 1|Amministration|
| 2| Public|
| 3| Store|
+------+--------------+
交叉连接简单地通过以下方式给出:
val cross = people.join(area)
+---+------+-------+------+------+--------------+
| id| name| mail|idArea|idArea| areaName|
+---+------+-------+------+------+--------------+
| 1| Jack|[email protected]| 1| 1|Amministration|
| 1| Jack|[email protected]| 1| 3| Store|
| 1| Jack|[email protected]| 1| 2| Public|
| 2|Valery|[email protected]| 1| 1|Amministration|
| 2|Valery|[email protected]| 1| 3| Store|
| 2|Valery|[email protected]| 1| 2| Public|
| 3| Karl|[email protected]| 2| 1|Amministration|
| 3| Karl|[email protected]| 2| 2| Public|
| 3| Karl|[email protected]| 2| 3| Store|
| 4| Nick|[email protected]| 2| 3| Store|
| 4| Nick|[email protected]| 2| 2| Public|
| 4| Nick|[email protected]| 2| 1|Amministration|
| 5| Luke|[email protected]| 3| 2| Public|
| 5| Luke|[email protected]| 3| 3| Store|
| 5| Luke|[email protected]| 3| 1|Amministration|
| 6| Marek|[email protected]| 3| 1|Amministration|
| 6| Marek|[email protected]| 3| 2| Public|
| 6| Marek|[email protected]| 3| 3| Store|
+---+------+-------+------+------+--------------+