假设我有一个 PySpark 数据框:
X Y
1 a
1 b
1 c
2 b
2 a
2 c
3 a
3 c
3 b
4 p
我必须选择任何可能的 X 和 Y 对,但相同的 X 和 Y 不应在结果中重复。
可能的输出1
X Y
1 b
2 c
3 a
4 p
可能的输出2
X Y
2 a
1 b
3 c
4 p
如何高效地实现这一目标?给定的数据框可能非常大。
检查下面的代码。
WITH in_cte AS (
SELECT
X,
Y,
FILTER(
ARRAY_DISTINCT(
COLLECT_LIST(Y) OVER(ORDER BY 1)
),
(ELEM, INDEX) -> ELEM == Y AND INDEX + 1 == X
)[0] AS new_y
FROM VALUES (1,"a"), (1,"b"), (1,"c"), (2,"b"), (2,"a"), (2,"c"), (3,"a"), (3,"c"), (3,"b"), (4,"p"),(5, "e"),(5, "f"),(6,"a"),(6,"b") AS (X, Y)
)
SELECT X, Y FROM in_cte
WHERE new_y IS NOT NULL
+---+---+
|X |Y |
+---+---+
|1 |a |
|2 |b |
|3 |c |
|4 |p |
|5 |e |
+---+---+