假设我有一个包含 3 列的表:id、date_time、颜色。数据如下:
id, date_time, color
1, 2023-10-01 12:15, green
1, 2023-10-01 12:16, yellow
1, 2023-10-01 12:17, yellow
1, 2023-10-01 12:18, red
1, 2023-10-01 12:19, yellow
1, 2023-10-01 12:20, yellow
1, 2023-10-01 12:21, red
1, 2023-10-01 12:22, red
1, 2023-10-01 12:23, green
1, 2023-10-01 12:24, yellow
1, 2023-10-01 12:25, yellow
1, 2023-10-01 12:26, red
2, 2023-10-01 12:27, red
2, 2023-10-01 12:28, green
2, 2023-10-01 12:29, green
2, 2023-10-01 12:30, yellow
我需要计算“黄色”值在“颜色”列中出现的频率,按“id”列分组,按日期时间排序。不过,我有具体的条件:
它看起来像一个组中的子窗口。
我在 AWS Athena 中使用 SQL Presto,我相信我应该使用窗口函数,但我不确定如何指定这些条件。
提前感谢您的提示
所以,预期的结果应该是:
For id=1: Count "yellow" = 4
For id=2: Count "yellow" = 1
我尝试过这个,但是缺少在条件内重复黄色的累积计数器。
`with testdata(id, date_time, color) as (
VALUES
(1, cast('2023-10-01 12:15:00' as timestamp), 'green'),
(1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:18:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:21:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:22:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:23:00' as timestamp), 'green'),
(1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:26:00' as timestamp), 'red'),
(2, cast('2023-10-01 12:27:00' as timestamp), 'red'),
(2, cast('2023-10-01 12:28:00' as timestamp), 'green'),
(2, cast('2023-10-01 12:29:00' as timestamp), 'green'),
(2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
)
,t1 as (
SELECT id,
date_time,
color,
LAG(color) OVER (
PARTITION BY id
ORDER BY date_time
) AS prev_color,
LEAD(color) OVER (
PARTITION BY id
ORDER BY date_time
) AS next_color
FROM testdata
)
select id,
SUM(
CASE
WHEN color = 'yellow'
AND (
prev_color = 'green'
and (
next_color IS NULL
OR next_color = 'red'
OR next_color = 'yellow'
)
) THEN 1 ELSE 0
END
) AS yellow_count
FROM t1
group by id`
我有价值观
您有一个
gaps and islands
问题
以下是解决该问题的步骤:
cte - 使用两个 row_number 之间的差异为每个连续的行组提供唯一的 id。
cte2 & 3 - 按 id 和颜色对数据进行分组,以便可以轻松获取每种颜色的上一个和下一个值。
然后将您的条件应用于 cte3 的结果以获得预期数据。
with data (id, date_time, color) as (
VALUES
(1, cast('2023-10-01 12:15:00' as timestamp), 'green'),
(1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:18:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:21:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:22:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:23:00' as timestamp), 'green'),
(1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:26:00' as timestamp), 'red'),
(2, cast('2023-10-01 12:27:00' as timestamp), 'red'),
(2, cast('2023-10-01 12:28:00' as timestamp), 'green'),
(2, cast('2023-10-01 12:29:00' as timestamp), 'green'),
(2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
),
cte as (
SELECT id, date_time, color,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date_time)
- ROW_NUMBER() OVER (PARTITION BY id, color ORDER BY date_time) AS rn
FROM data
),
cte2 as (
select id, rn,
max(color) as color,
max(date_time) as date_time,
sum(case when color = 'yellow' then 1 end) as total
from cte c1
group by id, rn
order by date_time
),
cte3 as (
select *, LAG(color) OVER ( PARTITION BY id ORDER BY date_time) AS prev_color,
LEAD(color) OVER ( PARTITION BY id ORDER BY date_time) AS next_color
from cte2
)
select id, sum(total) as total
from cte3
where color = 'yellow' and prev_color = 'green' and ( next_color = 'red' or next_color is null)
group by id;
结果:
id | 总计 |
---|---|
1 | 4 |
2 | 1 |