从 Pyspark Dataframe 中选择唯一对

问题描述 投票:0回答:1

假设我有一个 PySpark 数据框:

X Y
1 a
1 b
1 c
2 b
2 a
2 c
3 a
3 c
3 b
4 p

我必须选择任何可能的 X 和 Y 对,但相同的 X 和 Y 不应在结果中重复。

可能的输出1

X Y
1 b
2 c
3 a
4 p

可能的输出2

X Y
2 a
1 b
3 c
4 p

如何高效地实现这一目标?给定的数据框可能非常大。

python apache-spark pyspark apache-spark-sql data-analysis
1个回答
0
投票

检查下面的代码。

WITH in_cte AS (
    SELECT
    X,
    Y,
    FILTER(
        ARRAY_DISTINCT(
            COLLECT_LIST(Y) OVER(ORDER BY 1)
        ),
        (ELEM, INDEX) -> ELEM == Y AND INDEX + 1 == X
    )[0] AS new_y
    FROM VALUES (1,"a"), (1,"b"), (1,"c"), (2,"b"), (2,"a"), (2,"c"), (3,"a"), (3,"c"), (3,"b"), (4,"p"),(5, "e"),(5, "f"),(6,"a"),(6,"b") AS (X, Y)
)
SELECT X, Y FROM in_cte
WHERE new_y IS NOT NULL

+---+---+
|X  |Y  |
+---+---+
|1  |a  |
|2  |b  |
|3  |c  |
|4  |p  |
|5  |e  |
+---+---+
© www.soinside.com 2019 - 2024. All rights reserved.