我正在尝试创建一个列,表示每个产品最近的促销活动是在多少天前开始的。它应该继续依赖促销结束的时间,直到下一次促销开始。
例如,我想要以下内容:
产品 | 日期 | in_促销 | 自上次促销以来的天数 |
---|---|---|---|
1 | 2023-10-01 | 假 | 空 |
1 | 2023-10-02 | 假 | 空 |
1 | 2023-10-03 | 真实 | 0 |
1 | 2023-10-04 | 真实 | 1 |
1 | 2023-10-05 | 真实 | 2 |
1 | 2023-10-06 | 假 | 3 |
1 | 2023-10-07 | 假 | 4 |
1 | 2023-10-08 | 真实 | 0 |
1 | 2023-10-09 | 真实 | 1 |
1 | 2023-10-10 | 假 | 2 |
特别是,我很难找到正确的
days_since_last_promo
这些行:
产品 | 日期 | in_促销 | 自上次促销以来的天数 |
---|---|---|---|
1 | 2023-10-06 | 假 | 3 |
1 | 2023-10-07 | 假 | 4 |
我一直对滞后、row_number() 和分区感到困惑,但我无法弄清楚。这在 SQL 中可能吗?
我想说它与这篇文章有关,但我们正在尝试实现一些略有不同的东西。
我尝试过例如
select
product
, date
, in_promo
, row_number() over (partition by recipe_id, in_promo, seqnum_u - seqnum_uo
order by date_cet
) as days_since_last_promo
from (select p.*,
row_number() over (partition by product order by date) as seqnum_u,
row_number() over (partition by product, in_promo order by date) as seqnum_uo
from product_sales_data as p
)
但这会给我
产品 | 日期 | in_促销 | 自上次促销以来的天数 |
---|---|---|---|
1 | 2023-10-01 | 假 | 1 |
1 | 2023-10-02 | 假 | 2 |
1 | 2023-10-03 | 真实 | 1 |
1 | 2023-10-04 | 真实 | 2 |
1 | 2023-10-05 | 真实 | 3 |
1 | 2023-10-06 | 假 | 1 |
1 | 2023-10-07 | 假 | 2 |
1 | 2023-10-08 | 真实 | 1 |
1 | 2023-10-09 | 真实 | 2 |
1 | 2023-10-10 | 假 | 1 |
即当
in_promo=false
. 时重新启动 row_number
这里是使用 ORACLE 语法但具有标准分析函数的解决方案,假设促销的开始是一系列行的 in_promo 的第一个日期 ( in_promo* !in_promo+ )(使用 MATCH_RECOGNIZE 应该更容易,但仅限 ORACLE):
with data(product, dat, in_promo) as (
select 1, date '2023-10-01', 'false' from dual union all
select 1, date '2023-10-02', 'false' from dual union all
select 1, date '2023-10-03', 'true' from dual union all
select 1, date '2023-10-04', 'true' from dual union all
select 1, date '2023-10-05', 'true' from dual union all
select 1, date '2023-10-06', 'false' from dual union all
select 1, date '2023-10-07', 'false' from dual union all
select 1, date '2023-10-08', 'true' from dual union all
select 1, date '2023-10-09', 'true' from dual union all
select 1, date '2023-10-10', 'false' from dual
)
select d.product, d.dat,
sum(ndays) over(partition by product, grp order by dat) as days_since_last_promo
from (
select d.*,
case when in_promo = 0 and grp = 0 then null
else
nvl(
dat - last_value(dat) over(partition by product, grp order by dat
rows between unbounded preceding and 1 preceding),
0
)
end
as ndays
from (
select d.*,
sum(change) over(partition by product order by dat) as grp
from (
select d.*,
decode(in_promo,1,
decode(1,lag(in_promo) over(partition by product order by dat),0,1),
0
) as change
from (select product, dat, decode(in_promo,'true',1,0) as in_promo from data) d
) d
) d
) d
order by dat
;
1 01/10/2023 00:00:00
1 02/10/2023 00:00:00
1 03/10/2023 00:00:00 0
1 04/10/2023 00:00:00 1
1 05/10/2023 00:00:00 2
1 06/10/2023 00:00:00 3
1 07/10/2023 00:00:00 4
1 08/10/2023 00:00:00 0
1 09/10/2023 00:00:00 1
1 10/10/2023 00:00:00 2
不知道为什么你的问题被否决了。拥有清晰的样本数据和诚实的解决方案尝试。
总体思路是,首先获取行号序列并标记每个记录,该记录要么源自 true(在促销中),要么从 false 切换到 true,这是促销的第一次新出现。
然后进行自连接以获取最新较早更改的 row_numbers 并取差值。调整1。
这是在 postgres 中,但我认为 bigquery 支持所有语法。
create table some_sample_data
( product integer,
_date date,
in_promo varchar(100)
)
insert into some_sample_data values (1, '2023-10-01', 'false');
insert into some_sample_data values (1, '2023-10-02', 'false');
insert into some_sample_data values (1, '2023-10-03', 'true');
insert into some_sample_data values (1, '2023-10-04', 'true');
insert into some_sample_data values (1, '2023-10-05', 'true');
insert into some_sample_data values (1, '2023-10-06', 'false');
insert into some_sample_data values (1, '2023-10-07', 'false');
insert into some_sample_data values (1, '2023-10-08', 'true');
insert into some_sample_data values (1, '2023-10-09', 'true');
insert into some_sample_data values (1, '2023-10-10', 'false');
insert into some_sample_data values (2, '2023-10-01', 'true');
insert into some_sample_data values (2, '2023-10-02', 'false');
insert into some_sample_data values (2, '2023-10-03', 'false');
insert into some_sample_data values (2, '2023-10-04', 'true');
with sequenced_result as (
SELECT
*,
row_number() over ( partition by product order by cast(_date as date) asc) rn,
case when in_promo = 'true'
and coalesce(lag(in_promo) OVER(PARTITION BY product ORDER BY cast(_date as date) asc),'false') = 'false'
then 1
else 0
end originated_as_or_change_to_true
FROM some_sample_data
),
earlier_snapshots as (
select t1.*,
max(t2.rn) as rn_of_last_change
from sequenced_result t1
left
join sequenced_result t2
on t1.product = t2.product
and t1._date >= t2._date
and t2.originated_as_or_change_to_true = 1
group
by t1.product,
t1._date,
t1.in_promo,
t1.rn,
t1.prev_value,
t1.originated_as_or_change_to_true
)
select product,
_date,
in_promo,
rn - rn_of_last_change + 1 as days_since_last_promo
from earlier_snapshots
order
by product,
_date