如何通过一列历史sql雪花获取组合结果

问题描述 投票:0回答:1

我有一个用户的表历史任务,看起来像这样。 1 个用户有 9 个任务。

user_id     task    IS_COMPLETED updated_at
123       Task 1    1            2024-01-01
123       Task 2    1            2024-01-01
123       Task 3    0            2024-01-01
123       Task 4    0            2024-01-01
123       Task 5    0            2024-01-01
123       Task 6    1            2024-01-01
123       Task 7    1            2024-01-01
123       Task 8    1            2024-01-01
123       Task 9    1            2024-01-01

我想对任务的可能组合进行分组,成为这样的表格。所以我知道什么样的任务组合是用户快速完成的。

combination     total_user_completed
task1_task2         20
task1_task3         15 
task1_task_4        15
task1_task_5        14
:                    : (and so on for 2 combination) 
task1_task2_task3   10
task4_task5_task_6  11
:                   : (and so on for 3 combination) 

(有关组合示例的更多详细信息:从A,B,C我得到A-B,B-C,A-C和A-B-C之间的组合。不需要按顺序)

我尝试过递归sql,但效果不佳

“递归连接内存不足。请在 更大的仓库”

然后我也尝试用case when来做,但是不太理想,需要像这样一一定义。

    with all_task as (
            select 
                distinct user_id,
                task_name as task, 
                is_completed,
                updated_at
            from task
            where 
                is_completed = 1
            group by all
          ),
        agg as (
            select 
                distinct user_id, 
                activated_at as period,
                max(case when task in ('Task 1') and IS_COMPLETED = 1 then date(updated_at) end) as task_1,
                max(case when task in ('Task 2') and IS_COMPLETED = 1 then date(updated_at) end) as task_2,
                max(case when task in ('Task 3') and IS_COMPLETED = 1 then date(updated_at) end) as task_3,
                max(case when task in ('Task 4') and IS_COMPLETED = 1 then date(updated_at) end) as task_4,
                max(case when task in ('Task 5') and IS_COMPLETED = 1 then date(updated_at) end) as task_5,
                max(case when task in ('Task 6') and IS_COMPLETED = 1 then date(updated_at) end) as task_6,
                max(case when task in ('Task 7') and IS_COMPLETED = 1 then date(updated_at) end) as task_7,
                max(case when task in ('Task 8') and IS_COMPLETED = 1 then date(updated_at) end) as task_8,
                max(case when task in ('Task 9') and IS_COMPLETED = 1 then date(updated_at) end) as task_9
            from all_task 
            group by all
            ),
            task_group as (
            select user_id, 
                case when task_1 is not null and task_2 is not null then user_id end as task_1_2
                case when task_2 not null and task_3 is not null then user_id end as task_2_3
                .........
            from agg
             )
            select 'Task1_task2' as combination, 
            count(distinct task_1_2) as total
            from task_group

            union all 

            select 'task2_task3' as combination, 
            count(distinct task_2_3) as total
            from task_group
.... (and so on)

有什么建议可以解决这个问题吗?非常感谢!

*请注意,现在我至少需要 2/3 的任务组合。谢谢你

sql snowflake-cloud-data-platform logic method-combination
1个回答
0
投票

对于 MS SQL Server 2016:(应该适用于 Snowflake)

我想我有办法得到你想要的东西。它需要手动指定排列数量,并且不会创建所有可能的排列。这也会创建一个相当大的表,随着任务数量的增加呈指数级增长。

这里我有一个针对任务列的 3 种排列的可行解决方案。您可以通过添加代码来添加更多内容。

它的工作原理是获取

DISTINCT
任务和
CROSS JOINING
这些任务,同时检查排列是否存在多次。然后我们得到已完成任务的数量和
LEFT JOIN
这些排列,以获得每个排列完成的任务数量

-- Create a temporary table
CREATE TABLE #TempTable
(
    user_id INT,
    task VARCHAR(50),
    IS_COMPLETED BIT,
    updated_at DATE
);

-- Insert data into the temporary table
INSERT INTO #TempTable
(
    user_id,
    task,
    IS_COMPLETED,
    updated_at
)
VALUES
(123, 'Task 1', 1, '2024-01-01'),
(123, 'Task 2', 1, '2024-01-01'),
(123, 'Task 3', 0, '2024-01-01'),
(123, 'Task 4', 0, '2024-01-01'),
(123, 'Task 5', 0, '2024-01-01'),
(123, 'Task 6', 1, '2024-01-01'),
(123, 'Task 7', 1, '2024-01-01'),
(123, 'Task 8', 1, '2024-01-01'),
(123, 'Task 9', 1, '2024-01-01');

;with cteAllColumns
as (select DISTINCT
        Task as col
    from #TempTable
   )
-- See commented code for adding more permutations

select c1.col as 'c1.task',
       c2.col as 'c2.task',
       c3.col as 'c3.task',
       --c4.col as 'c4.task',
       ISNULL(SUM(result.completed), 0) as 'combination completed'
from cteAllColumns c1
    cross join cteAllColumns c2
    cross join cteAllColumns c3

    --cross join cteAllColumns c4

    LEFT JOIN
    (
        SELECT task,
               COUNT(task) as 'completed'
        FROM #TempTable
        WHERE IS_COMPLETED = 1
        GROUP BY task
    ) result
        ON result.task = c1.col
           OR result.task = c2.col
           OR result.task = c3.col
--OR result.task = c4.col


where c1.col < c2.col
      AND c2.col < c3.col
--AND c3.col < c4.col

GROUP BY c1.col,
         c2.col,
         c3.col --,c4.col

ORDER BY c1.col,
         c2.col,
         c3.col --,c4.col


DROP TABLE #TempTable;
© www.soinside.com 2019 - 2024. All rights reserved.