如何使用 pyspark 或 sql 根据 2 列之间的匹配值对值进行分组

问题描述 投票:0回答:1

假设我们有一个数据框 如果我们观察数据。 4-1234 和 4-1235 代码与 MUMS12345A tan 相关。如果我们从 tanlist 列中看到,则相反。 MUMS12345A tan 与这两个代码相关联

客户代码 谭列表
4-1234 MUMS12345A,BLRS12345E,BLRS12345G
4-1235 MUMS12345A,CHED12345A
4-1236 RTKD12345A

我需要 pyspark 或 sql 代码来获得所需的输出 这是所需的输出:

客户代码列表 TAN列表
4-1234, 4-1235 MUMS12345A,BLRS12345E,BLRS12345G,CHED12345A
4-1236 RTKD12345A

我已经提供了输入和预期结果。

apache-spark pyspark apache-spark-sql
1个回答
0
投票

检查下面的代码。

WITH in_cte AS (
    SELECT        
        MAP_FILTER(
            AGGREGATE(
                COLLECT_LIST(MAP(customer_code,tanlist)) over(order by 1), 
                MAP('',''), 
                (acc, e) -> MAP_CONCAT(acc, e) 
            ), 
            (k, v) -> v == tanlist OR ARRAYS_OVERLAP(SPLIT(v, ','), SPLIT(tanlist, ','))
        ) AS output
    FROM VALUES 
        ("4-1234","MUMS12345A,BLRS12345E,BLRS12345G"),
        ("4-1235","MUMS12345A,CHED12345A"),
        ("4-1236","RTKD12345A") 
    AS (customer_code, tanlist)
)
SELECT 
    DISTINCT 
    CONCAT_WS(',',MAP_KEYS(output)) AS customer_code, 
    CONCAT_WS(',', MAP_VALUES(output)) AS tanlist 
FROM in_cte
+-------------+------------------------------------------------------+
|customer_code|tanlist                                               |
+-------------+------------------------------------------------------+
|4-1234,4-1235|MUMS12345A,BLRS12345E,BLRS12345G,MUMS12345A,CHED12345A|
|4-1236       |RTKD12345A                                            |
+-------------+------------------------------------------------------+
© www.soinside.com 2019 - 2024. All rights reserved.