在子窗口 SQL Presto 内执行具有特定条件的值计数

问题描述 投票:0回答:1

假设我有一个包含 3 列的表:id、date_time、颜色。数据如下:

id, date_time, color
1, 2023-10-01 12:15, green
1, 2023-10-01 12:16, yellow
1, 2023-10-01 12:17, yellow
1, 2023-10-01 12:18, red
1, 2023-10-01 12:19, yellow
1, 2023-10-01 12:20, yellow
1, 2023-10-01 12:21, red
1, 2023-10-01 12:22, red
1, 2023-10-01 12:23, green
1, 2023-10-01 12:24, yellow
1, 2023-10-01 12:25, yellow
1, 2023-10-01 12:26, red
2, 2023-10-01 12:27, red
2, 2023-10-01 12:28, green
2, 2023-10-01 12:29, green
2, 2023-10-01 12:30, yellow

我需要计算“黄色”值在“颜色”列中出现的频率,按“id”列分组,按日期时间排序。不过,我有具体的条件:

  1. 我只想计算“黄色”出现在“绿色”之后的情况。
  2. 仅当“黄色”后面跟着第一个“红色”或者它是“id”定义的组中的最后一个值时,我才想计算“黄色”。

它看起来像一个组中的子窗口。

我在 AWS Athena 中使用 SQL Presto,我相信我应该使用窗口函数,但我不确定如何指定这些条件。

提前感谢您的提示

所以,预期的结果应该是:

For id=1: Count "yellow" = 4
For id=2: Count "yellow" = 1

我尝试过这个,但是缺少在条件内重复黄色的累积计数器。

`with testdata(id, date_time, color) as (
  VALUES 
  (1, cast('2023-10-01 12:15:00' as timestamp), 'green'),
  (1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:18:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:21:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:22:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:23:00' as timestamp), 'green'),
  (1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:26:00' as timestamp), 'red'),
  (2, cast('2023-10-01 12:27:00' as timestamp), 'red'),
  (2, cast('2023-10-01 12:28:00' as timestamp), 'green'),
  (2, cast('2023-10-01 12:29:00' as timestamp), 'green'),
  (2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
)
,t1 as (
  SELECT id,
    date_time,
    color,
    LAG(color) OVER (
      PARTITION BY id
      ORDER BY date_time
    ) AS prev_color,
    LEAD(color) OVER (
      PARTITION BY id
      ORDER BY date_time
    ) AS next_color
  FROM testdata
)
select id,
  SUM(
    CASE
      WHEN color = 'yellow'
      AND (
        prev_color = 'green'
        and (
          next_color IS NULL
          OR next_color = 'red'
          OR next_color = 'yellow'
        )
      ) THEN 1 ELSE 0
    END
  )  AS yellow_count
FROM t1
group by id`

我有价值观

  • 对于 id=1:计数“黄色”= 2(不正确)
  • 对于 id=2:计数“黄色”= 1(正确)
sql window-functions amazon-athena presto gaps-and-islands
1个回答
0
投票

您有一个

gaps and islands
问题

以下是解决该问题的步骤:

cte - 使用两个 row_number 之间的差异为每个连续的行组提供唯一的 id。

cte2 & 3 - 按 id 和颜色对数据进行分组,以便可以轻松获取每种颜色的上一个和下一个值。

然后将您的条件应用于 cte3 的结果以获得预期数据。

with data (id, date_time, color) as (
  VALUES 
  (1, cast('2023-10-01 12:15:00' as timestamp), 'green'),
  (1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:18:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:21:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:22:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:23:00' as timestamp), 'green'),
  (1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:26:00' as timestamp), 'red'),
  (2, cast('2023-10-01 12:27:00' as timestamp), 'red'),
  (2, cast('2023-10-01 12:28:00' as timestamp), 'green'),
  (2, cast('2023-10-01 12:29:00' as timestamp), 'green'),
  (2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
),
cte as (
  SELECT id, date_time, color, 
         ROW_NUMBER() OVER (PARTITION BY id ORDER BY date_time)
         - ROW_NUMBER() OVER (PARTITION BY id, color ORDER BY date_time) AS rn
  FROM data
),
cte2 as (
  select id, rn, 
         max(color) as color,
         max(date_time) as date_time,
         sum(case when color = 'yellow' then 1 end) as total
  from cte c1
  group by id, rn
  order by date_time
),
cte3 as (
  select *, LAG(color) OVER ( PARTITION BY id ORDER BY date_time) AS prev_color,
          LEAD(color) OVER ( PARTITION BY id ORDER BY date_time) AS next_color 
  from cte2
)
select id, sum(total) as total
from cte3
where color = 'yellow' and prev_color = 'green' and ( next_color = 'red' or next_color is null)
group by id;

结果:

id 总计
1 4
2 1

postgresql 演示

© www.soinside.com 2019 - 2024. All rights reserved.