问题概述:
示例:
数据集1:
id trans_id
1 a
1 b
1 c
2 c
2 d
2 e
2 f
数据集2:
id trans_id score
1 a 0.3
1 b 0.4
1 c 0.5
1 d 0.1
1 e 0.2
1 f 0.5
2 a 0.1
2 b 0.5
2 c 0.6
2 d 0.8
2 e 0.9
2 f 0.2
最终数据集:
id trans_id score
1 d 0.1
1 e 0.2
1 f 0.5
2 a 0.1
2 b 0.5
我正在尝试在scala中执行此操作(python是我选择的语言),我有点迷茫。如果仅使用一个ID,则可以使用isin
函数,但不确定如何对所有ID进行此操作。
任何帮助将不胜感激。
left_anti
连接:val df1 = Seq(
(1, "a"), (1, "b"), (1, "c"),
(2, "c"), (2, "d"), (2, "e"), (2, "f")
).toDF("id", "trans_id")
val df2 = Seq(
(1, "a", 0.3), (1, "b", 0.4), (1, "c", 0.5), (1, "d", 0.1), (1, "e", 0.2), (1, "f", 0.5),
(2, "a", 0.1), (2, "b", 0.5), (2, "c", 0.6), (2, "d", 0.8), (2, "e", 0.9), (2, "f", 0.2)
).toDF("id", "trans_id", "score")
df2.join(df1, Seq("id", "trans_id"), "left_anti").show
// +---+--------+-----+
// | id|trans_id|score|
// +---+--------+-----+
// | 1| d| 0.1|
// | 1| e| 0.2|
// | 1| f| 0.5|
// | 2| a| 0.1|
// | 2| b| 0.5|
// +---+--------+-----+