假设我们有一个数据框 如果我们观察数据。 4-1234 和 4-1235 代码与 MUMS12345A tan 相关。如果我们从 tanlist 列中看到,则相反。 MUMS12345A tan 与这两个代码相关联
客户代码 | 谭列表 |
---|---|
4-1234 | MUMS12345A,BLRS12345E,BLRS12345G |
4-1235 | MUMS12345A,CHED12345A |
4-1236 | RTKD12345A |
我需要 pyspark 或 sql 代码来获得所需的输出 这是所需的输出:
客户代码列表 | TAN列表 |
---|---|
4-1234, 4-1235 | MUMS12345A,BLRS12345E,BLRS12345G,CHED12345A |
4-1236 | RTKD12345A |
我已经提供了输入和预期结果。
检查下面的代码。
WITH in_cte AS (
SELECT
MAP_FILTER(
AGGREGATE(
COLLECT_LIST(MAP(customer_code,tanlist)) over(order by 1),
MAP('',''),
(acc, e) -> MAP_CONCAT(acc, e)
),
(k, v) -> v == tanlist OR ARRAYS_OVERLAP(SPLIT(v, ','), SPLIT(tanlist, ','))
) AS output
FROM VALUES
("4-1234","MUMS12345A,BLRS12345E,BLRS12345G"),
("4-1235","MUMS12345A,CHED12345A"),
("4-1236","RTKD12345A")
AS (customer_code, tanlist)
)
SELECT
DISTINCT
CONCAT_WS(',',MAP_KEYS(output)) AS customer_code,
CONCAT_WS(',', MAP_VALUES(output)) AS tanlist
FROM in_cte
+-------------+------------------------------------------------------+
|customer_code|tanlist |
+-------------+------------------------------------------------------+
|4-1234,4-1235|MUMS12345A,BLRS12345E,BLRS12345G,MUMS12345A,CHED12345A|
|4-1236 |RTKD12345A |
+-------------+------------------------------------------------------+