当我尝试在 SparkSQL 中使用
PARTITION BY
时,我遇到了这个复杂的查询:
对于当前的每一行,使用
user_id,product_id,[create_date-3day,create_date+3day]
作为数据窗口,进行一些数据查询(例如LAST_VALUE()
)。一个关键部分是在查询时,我需要按列 ORDER BY NEW_DATE
进行进一步排序,这样我就可以确保每个窗口中的数据都按 New_date 排序——这是我想要查询的真正列。
所以一开始我的想法是使用这样的子句,看起来像这样:
LAST_VALUE() over(PARTITION BY user_id,product_id ORDER BY create_date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING)
使用与
ORDER BY
组合之间的范围应该可以。但这里 order by 与 range Between 结合使用,在窗口中进行过滤。我需要进一步的 ORDER BY 才能实现 ORDER BY NEW_DATE
。
但是这样的查询不起作用:
LAST_VALUE() over(PARTITION BY user_id,product_id ORDER BY create_date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING ORDER BY NEW_DATE)
在此子句中使用两个 orderby 将无法编译。我怎样才能进一步实现这一目标?或者有其他方法可以实现这一目标吗?
ORDER BY
子句在 Windows 子句中有双重用途。一方面它用于确定窗口,另一方面它为所使用的函数提供顺序上下文。
为了说明这一点:
ORDER BY
用于 ROWS | RANGE BETWEEN
子句。MAX
、COUNT
、SUM
等。它用于创建运行总计(例如,使用 ORDER BY
子句,MAX
为我们提供该行之前的最大值,而不是该行的最大值整个窗口)。但是,第二部分仅适用于使用
ROWS
窗口时。当使用 RANGE
窗口时,则不会。 MAX
中的 RANGE
始终会为我们提供整个范围的 MAX
,而不是逐行递增的最大值。
现在,您想要一个订单到达范围窗口,然后再为您的函数提供另一个订单。这意味着您需要两个步骤,一是获取窗口,一是在这些窗口上应用该函数。在以下查询中,我使用
MIN OVER
获取组密钥,然后在每个组上应用 LAST_VALUE
:
WITH
grouped AS
(
SELECT t.*,
MIN(rowid) OVER (PARTITION BY user_id, product_id
ORDER BY create_date
RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS grp
FROM mytable t
)
SELECT
grouped.*,
LAST_VALUE(price) OVER (PARTITION BY grp ORDER BY new_date) AS last_price
FROM grouped
ORDER BY grp, new_date;
model
子句为窗口和计算定义不同的计算上下文。
请参阅下面的代码
SQL> select * 2 from t 3 model 4 /*Partition by as in window spec*/ 5 partition by (user_id, product_id) 6 /*Dimension column is what should be used to define window size*/ 7 dimension by (create_date) 8 /*Put original value to use in FIRST_VALUE and ORDER BY column*/ 9 measures(value_, value_*0 as new_value, new_date) 10 rules update ( 11 new_value[any] = max(value_) keep(dense_rank first order by new_date asc)[ 12 /*Window size applied to all "measures" 13 in the above function, including NEW_DATE*/ 14 create_date between cv(create_date) - interval '3' day and cv(create_date) + interval '3' day 15 ] 16 ) 17 order by 1,2,3 asc
对于此示例数据:
> SQL> create table t(
> 2 user_id number,
> 3 product_id number,
> 4 create_date date,
> 5 new_date date,
> 6 value_ number
> 7 )
> 8 /
>
> Table T created.
>
> SQL>
> SQL> begin
> 2 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-17',date '2023-12-17','1');
> 3 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-16',date '2023-12-18','2');
> 4 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-15',date '2023-12-15','3');
> 5 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-14',date '2023-12-16','4');
> 6 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-13',date '2023-12-12','5');
> 7 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-12',date '2023-12-14','6');
> 8 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-11',date '2023-12-11','7');
> 9 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-10',date '2023-12-11','8');
> 10 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-09',date '2023-12-14','9');
> 11 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-08',date '2023-12-16','10');
> 12 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-07',date '2023-12-15','11');
> 13 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-06',date '2023-12-14','12');
> 14 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-05',date '2023-12-14','13');
> 15 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-04',date '2023-12-17','14');
> 16 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-03',date '2023-12-18','15');
> 17 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-02',date '2023-12-10','16');
> 18 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-01',date '2023-12-09','17');
> 19 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-11-30',date '2023-12-11','18');
> 20 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-11-29',date '2023-12-16','19');
> 21 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-28',date '2023-12-15','20');
> 22 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-27',date '2023-12-14','21');
> 23 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-26',date '2023-12-17','22');
> 24 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-25',date '2023-12-16','23');
> 25 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-24',date '2023-12-11','24');
> 26 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-23',date '2023-12-15','25');
> 27 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-22',date '2023-12-16','26');
> 28 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-21',date '2023-12-10','27');
> 29 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-20',date '2023-12-12','28');
> 30 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-19',date '2023-12-14','29');
> 31 Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','3',date '2023-11-18',date '2023-12-11','30');
> 32 commit;
> 33 end;
> 34 /
>
> PL/SQL procedure successfully completed.
退货
> USER_ID PRODUCT_ID CREATE_DATE VALUE_ NEW_VALUE NEW_DATE
> ---------- ---------- ------------------- ---------- ---------- -------------------
> 0 0 2023-12-09 00:00:00 9 8 2023-12-14 00:00:00
> 0 0 2023-12-10 00:00:00 8 8 2023-12-11 00:00:00
> 0 0 2023-12-11 00:00:00 7 8 2023-12-11 00:00:00
> 0 0 2023-12-12 00:00:00 6 8 2023-12-14 00:00:00
> 0 0 2023-12-13 00:00:00 5 8 2023-12-12 00:00:00
> 0 0 2023-12-14 00:00:00 4 7 2023-12-16 00:00:00
> 0 0 2023-12-15 00:00:00 3 5 2023-12-15 00:00:00
> 0 0 2023-12-16 00:00:00 2 5 2023-12-18 00:00:00
> 0 0 2023-12-17 00:00:00 1 3 2023-12-17 00:00:00
> 0 1 2023-11-29 00:00:00 19 17 2023-12-16 00:00:00
> 0 1 2023-11-30 00:00:00 18 17 2023-12-11 00:00:00
> 0 1 2023-12-01 00:00:00 17 17 2023-12-09 00:00:00
> 0 1 2023-12-02 00:00:00 16 17 2023-12-10 00:00:00
> 0 1 2023-12-03 00:00:00 15 17 2023-12-18 00:00:00
> 0 1 2023-12-04 00:00:00 14 17 2023-12-17 00:00:00
> 0 1 2023-12-05 00:00:00 13 16 2023-12-14 00:00:00
> 0 1 2023-12-06 00:00:00 12 13 2023-12-14 00:00:00
> 0 1 2023-12-07 00:00:00 11 13 2023-12-15 00:00:00
> 0 1 2023-12-08 00:00:00 10 13 2023-12-16 00:00:00
> 1 2 2023-11-19 00:00:00 29 27 2023-12-14 00:00:00
> 1 2 2023-11-20 00:00:00 28 27 2023-12-12 00:00:00
> 1 2 2023-11-21 00:00:00 27 27 2023-12-10 00:00:00
> 1 2 2023-11-22 00:00:00 26 27 2023-12-16 00:00:00
> 1 2 2023-11-23 00:00:00 25 27 2023-12-15 00:00:00
> 1 2 2023-11-24 00:00:00 24 27 2023-12-11 00:00:00
> 1 2 2023-11-25 00:00:00 23 24 2023-12-16 00:00:00
> 1 2 2023-11-26 00:00:00 22 24 2023-12-17 00:00:00
> 1 2 2023-11-27 00:00:00 21 24 2023-12-14 00:00:00
> 1 2 2023-11-28 00:00:00 20 21 2023-12-15 00:00:00
> 1 3 2023-11-18 00:00:00 30 30 2023-12-11 00:00:00
>
> 30 rows selected.