使用 SparkSQL 时 PARTITION BY 的“RANGE BETWEEN”问题

问题描述 投票:0回答:2

当我尝试在 SparkSQL 中使用

PARTITION BY
时,我遇到了这个复杂的查询:

对于当前的每一行,使用

user_id,product_id,[create_date-3day,create_date+3day]
作为数据窗口,进行一些数据查询(例如
LAST_VALUE()
)。一个关键部分是在查询时,我需要按列
ORDER BY NEW_DATE
进行进一步排序,这样我就可以确保每个窗口中的数据都按 New_date 排序——这是我想要查询的真正列。

所以一开始我的想法是使用这样的子句,看起来像这样:

LAST_VALUE() over(PARTITION BY user_id,product_id ORDER BY create_date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING)

使用与

ORDER BY
组合之间的范围应该可以。但这里 order by 与 range Between 结合使用,在窗口中进行过滤。我需要进一步的 ORDER BY 才能实现
ORDER BY NEW_DATE

但是这样的查询不起作用:

LAST_VALUE() over(PARTITION BY user_id,product_id ORDER BY create_date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING ORDER BY NEW_DATE)

在此子句中使用两个 orderby 将无法编译。我怎样才能进一步实现这一目标?或者有其他方法可以实现这一目标吗?

sql oracle apache-spark-sql hive
2个回答
0
投票

ORDER BY
子句在 Windows 子句中有双重用途。一方面它用于确定窗口,另一方面它为所使用的函数提供顺序上下文。

为了说明这一点:

  1. ORDER BY
    用于
    ROWS | RANGE BETWEEN
    子句。
  2. IN
    MAX
    COUNT
    SUM
    等。它用于创建运行总计(例如,使用
    ORDER BY
    子句,
    MAX
    为我们提供该行之前的最大值,而不是该行的最大值整个窗口)。

但是,第二部分仅适用于使用

ROWS
窗口时。当使用
RANGE
窗口时,则不会。
MAX
中的
RANGE
始终会为我们提供整个范围的
MAX
,而不是逐行递增的最大值。

现在,您想要一个订单到达范围窗口,然后再为您的函数提供另一个订单。这意味着您需要两个步骤,一是获取窗口,一是在这些窗口上应用该函数。在以下查询中,我使用

MIN OVER
获取组密钥,然后在每个组上应用
LAST_VALUE

WITH 
  grouped AS
  (
    SELECT t.*,
    MIN(rowid) OVER (PARTITION BY user_id, product_id
                     ORDER BY create_date
                     RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS grp
    FROM mytable t
  )
SELECT
  grouped.*,
  LAST_VALUE(price) OVER (PARTITION BY grp ORDER BY new_date) AS last_price
FROM grouped
ORDER BY grp, new_date;

0
投票

您可以使用

model
子句为窗口和计算定义不同的计算上下文。

请参阅下面的代码

SQL> select *
  2  from t
  3  model
  4    /*Partition by as in window spec*/
  5    partition by (user_id, product_id)
  6    /*Dimension column is what should be used to define window size*/
  7    dimension by (create_date)
  8    /*Put original value to use in FIRST_VALUE and ORDER BY column*/
  9    measures(value_, value_*0 as new_value, new_date)
 10    rules update (
 11      new_value[any] = max(value_) keep(dense_rank first order by new_date asc)[
 12          /*Window size applied to all "measures"
 13            in the above function, including NEW_DATE*/
 14          create_date between cv(create_date) - interval '3' day and cv(create_date) + interval '3' day
 15        ]
 16    )
 17  order by 1,2,3 asc

对于此示例数据:

> SQL> create table t(
>   2   user_id number, 
>   3   product_id number, 
>   4   create_date date, 
>   5   new_date date, 
>   6   value_ number
>   7  )
>   8  /
> 
> Table T created.
> 
> SQL> 
> SQL> begin
>   2  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-17',date '2023-12-17','1');
>   3  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-16',date '2023-12-18','2');
>   4  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-15',date '2023-12-15','3');
>   5  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-14',date '2023-12-16','4');
>   6  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-13',date '2023-12-12','5');
>   7  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-12',date '2023-12-14','6');
>   8  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-11',date '2023-12-11','7');
>   9  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-10',date '2023-12-11','8');
>  10  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','0',date '2023-12-09',date '2023-12-14','9');
>  11  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-08',date '2023-12-16','10');
>  12  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-07',date '2023-12-15','11');
>  13  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-06',date '2023-12-14','12');
>  14  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-05',date '2023-12-14','13');
>  15  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-04',date '2023-12-17','14');
>  16  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-03',date '2023-12-18','15');
>  17  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-02',date '2023-12-10','16');
>  18  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-12-01',date '2023-12-09','17');
>  19  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-11-30',date '2023-12-11','18');
>  20  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('0','1',date '2023-11-29',date '2023-12-16','19');
>  21  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-28',date '2023-12-15','20');
>  22  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-27',date '2023-12-14','21');
>  23  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-26',date '2023-12-17','22');
>  24  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-25',date '2023-12-16','23');
>  25  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-24',date '2023-12-11','24');
>  26  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-23',date '2023-12-15','25');
>  27  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-22',date '2023-12-16','26');
>  28  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-21',date '2023-12-10','27');
>  29  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-20',date '2023-12-12','28');
>  30  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','2',date '2023-11-19',date '2023-12-14','29');
>  31  Insert into T (USER_ID,PRODUCT_ID,CREATE_DATE,NEW_DATE,VALUE_) values ('1','3',date '2023-11-18',date '2023-12-11','30');
>  32  commit;
>  33  end;
>  34  /
> 
> PL/SQL procedure successfully completed.

退货

>    USER_ID PRODUCT_ID CREATE_DATE             VALUE_  NEW_VALUE NEW_DATE           
> ---------- ---------- ------------------- ---------- ---------- -------------------
>          0          0 2023-12-09 00:00:00          9          8 2023-12-14 00:00:00
>          0          0 2023-12-10 00:00:00          8          8 2023-12-11 00:00:00
>          0          0 2023-12-11 00:00:00          7          8 2023-12-11 00:00:00
>          0          0 2023-12-12 00:00:00          6          8 2023-12-14 00:00:00
>          0          0 2023-12-13 00:00:00          5          8 2023-12-12 00:00:00
>          0          0 2023-12-14 00:00:00          4          7 2023-12-16 00:00:00
>          0          0 2023-12-15 00:00:00          3          5 2023-12-15 00:00:00
>          0          0 2023-12-16 00:00:00          2          5 2023-12-18 00:00:00
>          0          0 2023-12-17 00:00:00          1          3 2023-12-17 00:00:00
>          0          1 2023-11-29 00:00:00         19         17 2023-12-16 00:00:00
>          0          1 2023-11-30 00:00:00         18         17 2023-12-11 00:00:00
>          0          1 2023-12-01 00:00:00         17         17 2023-12-09 00:00:00
>          0          1 2023-12-02 00:00:00         16         17 2023-12-10 00:00:00
>          0          1 2023-12-03 00:00:00         15         17 2023-12-18 00:00:00
>          0          1 2023-12-04 00:00:00         14         17 2023-12-17 00:00:00
>          0          1 2023-12-05 00:00:00         13         16 2023-12-14 00:00:00
>          0          1 2023-12-06 00:00:00         12         13 2023-12-14 00:00:00
>          0          1 2023-12-07 00:00:00         11         13 2023-12-15 00:00:00
>          0          1 2023-12-08 00:00:00         10         13 2023-12-16 00:00:00
>          1          2 2023-11-19 00:00:00         29         27 2023-12-14 00:00:00
>          1          2 2023-11-20 00:00:00         28         27 2023-12-12 00:00:00
>          1          2 2023-11-21 00:00:00         27         27 2023-12-10 00:00:00
>          1          2 2023-11-22 00:00:00         26         27 2023-12-16 00:00:00
>          1          2 2023-11-23 00:00:00         25         27 2023-12-15 00:00:00
>          1          2 2023-11-24 00:00:00         24         27 2023-12-11 00:00:00
>          1          2 2023-11-25 00:00:00         23         24 2023-12-16 00:00:00
>          1          2 2023-11-26 00:00:00         22         24 2023-12-17 00:00:00
>          1          2 2023-11-27 00:00:00         21         24 2023-12-14 00:00:00
>          1          2 2023-11-28 00:00:00         20         21 2023-12-15 00:00:00
>          1          3 2023-11-18 00:00:00         30         30 2023-12-11 00:00:00
> 
> 30 rows selected.
© www.soinside.com 2019 - 2024. All rights reserved.