我有下表:
child_id | child_dob | parent_id | parent_dob |
---|---|---|---|
1 | 2021-01-04 | 1 | 2021-01-01 |
2 | 2021-01-30 | 1 | 2021-01-01 |
3 | 2021-03-10 | 2 | 2021-01-15 |
4 | 2021-04-13 | 2 | 2021-01-15 |
我正在尝试获取自父母出生以来的每个月该月有多少个孩子出生以及第一个孩子出生的日期
所以最终输出应该是:
月 | parent_id | 每月儿童计数 | 日期_第一个_孩子 |
---|---|---|---|
2021-01-01 | 1 | 2 | 2021-01-04 |
2021-02-01 | 1 | 0 | 2021-01-04 |
2021-03-01 | 1 | 0 | 2021-01-04 |
. | . | . | . |
. | . | . | . |
. | . | . | . |
. | . | . | . |
2021-01-01 | 2 | 0 | 2021-03-10 |
2021-02-01 | 2 | 0 | 2021-03-10 |
2021-03-01 | 2 | 1 | 2021-03-10 |
2021-04-01 | 2 | 1 | 2021-03-10 |
. | . | . | . |
到目前为止,我所拥有的是parents_dob与date_trunc的分区,但并没有真正找到任何好的方法来添加连续的一个月,直到curr_timestamp。关于如何继续让窗口一次计算一个月以及如何将其增加到 current_timestamp
select count(dd.child_id) over w as count_of_children_in_month,
parent_dob,
min(dd.child_dob) over w as first_child_dob
from "awsdatacatalog"."stackoverflow"."desired_data" as dd
window w as (
partition by dd.parent_dob between date_trunc('month', dd.parent_dob) and current_timestamp
)
我正在使用 Athena,所以我可以使用所有 trino 功能。
您可以执行以下操作:
-- sample data
WITH dataset (child_id, child_dob, parent_id, parent_dob) AS (
values (1, date '2021-01-04', 1, date '2021-01-01'),
(2, date '2021-01-30', 1, date '2021-01-01'),
(3, date '2021-03-10', 2, date '2021-01-15'),
(4, date '2021-04-13', 2, date '2021-01-15')
),
-- query
-- generate all month in range, note that there is limit of 10k elements for sequence
dates as (
select *
from unnest(sequence(date '2021-01-01',
date '2021-05-01', -- current_date
interval '1' month)) as t(dt)
),
-- generate all month/parents pairs
all_dates_parents as (
select *
from dates d
cross join (select distinct parent_id from dataset) t
)
-- generate the result by join and aggregating
select r.*, count(child_id)
from all_dates_parents r
left join dataset ds on r.parent_id = ds.parent_id and r.dt = date_trunc('month', child_dob)
group by r.dt, r.parent_id
-- optional ordering for output
order by r.parent_id, r.dt;