PySpark 窗口函数：rangeBetween/rowsBetween 上的 orderBy 中的多个条件

Question

是否可以为 rangeBetween 或 rowsBetween 创建一个可以在 orderBy 中具有多个条件的窗口函数。假设我有一个如下所示的数据框。

user_id     timestamp               date        event
0040b5f0    2018-01-22 13:04:32     2018-01-22  1       
0040b5f0    2018-01-22 13:04:35     2018-01-22  0   
0040b5f0    2018-01-25 18:55:08     2018-01-25  1       
0040b5f0    2018-01-25 18:56:17     2018-01-25  1       
0040b5f0    2018-01-25 20:51:43     2018-01-25  1       
0040b5f0    2018-01-31 07:48:43     2018-01-31  1       
0040b5f0    2018-01-31 07:48:48     2018-01-31  0       
0040b5f0    2018-02-02 09:40:58     2018-02-02  1       
0040b5f0    2018-02-02 09:41:01     2018-02-02  0       
0040b5f0    2018-02-05 14:03:27     2018-02-05  1

每行，我需要日期不超过 3 天的 event 列值的总和。但我不能把同一天晚些时候发生的事件相加。我可以创建一个窗口函数，例如：

days = lambda i: i * 86400
my_window = Window\
                .partitionBy(["user_id"])\
                .orderBy(F.col("date").cast("timestamp").cast("long"))\
                .rangeBetween(-days(3), 0)

但这将包括同一日期晚些时候发生的事件。我需要创建一个窗口函数，其作用类似于（对于带有 * 的行）：

user_id     timestamp               date        event
0040b5f0    2018-01-22 13:04:32     2018-01-22  1----|==============|   
0040b5f0    2018-01-22 13:04:35     2018-01-22  0  sum here       all events
0040b5f0    2018-01-25 18:55:08     2018-01-25  1 only           within 3 days 
* 0040b5f0  2018-01-25 18:56:17     2018-01-25  1----|              |
0040b5f0    2018-01-25 20:51:43     2018-01-25  1===================|       
0040b5f0    2018-01-31 07:48:43     2018-01-31  1       
0040b5f0    2018-01-31 07:48:48     2018-01-31  0       
0040b5f0    2018-02-02 09:40:58     2018-02-02  1       
0040b5f0    2018-02-02 09:41:01     2018-02-02  0       
0040b5f0    2018-02-05 14:03:27     2018-02-05  1

我尝试创建类似的东西：

days = lambda i: i * 86400
my_window = Window\
                .partitionBy(["user_id"])\
                .orderBy(F.col("date").cast("timestamp").cast("long"))\
                .rangeBetween(-days(3), Window.currentRow)\
                .orderBy(F.col("t_stamp"))\
                .rowsBetween(Window.unboundedPreceding, Window.currentRow)

但它只反映了最后一个orderBy。

结果表应如下所示：

user_id     timestamp               date        event   event_last_3d
0040b5f0    2018-01-22 13:04:32     2018-01-22  1       1
0040b5f0    2018-01-22 13:04:35     2018-01-22  0       1
0040b5f0    2018-01-25 18:55:08     2018-01-25  1       2
0040b5f0    2018-01-25 18:56:17     2018-01-25  1       3
0040b5f0    2018-01-25 20:51:43     2018-01-25  1       4
0040b5f0    2018-01-31 07:48:43     2018-01-31  1       1
0040b5f0    2018-01-31 07:48:48     2018-01-31  0       1
0040b5f0    2018-02-02 09:40:58     2018-02-02  1       2
0040b5f0    2018-02-02 09:41:01     2018-02-02  0       2
0040b5f0    2018-02-05 14:03:27     2018-02-05  1       2

我已经在这个问题上坚持了一段时间了，我将不胜感激任何有关如何处理它的建议。

Answer 1

我已经用 scala 编写了等效的代码来满足您的要求。我想转换成python应该不难:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val DAY_SECS = 24*60*60 //Seconds in a day
//Given a timestamp in seconds, returns the seconds equivalent of 00:00:00 of that date
val trimToDateBoundary = (d: Long) => (d / 86400) * 86400
//Using 4 for range here - since your requirement is to cover 3 days prev, which date wise inclusive is 4 days
//So e.g. given any TS of 25 Jan, the range will cover (25 Jan 00:00:00 - 4 times day_secs = 22 Jan 00:00:00) to current TS
val wSpec = Window.partitionBy("user_id").
                orderBy(col("timestamp").cast("long")).
                rangeBetween(trimToDateBoundary(Window.currentRow)-(4*DAY_SECS), Window.currentRow)
df.withColumn("sum", sum('event) over wSpec).show()

以下是应用于您的数据时的输出：

+--------+--------------------+--------------------+-----+---+
| user_id|           timestamp|                date|event|sum|
+--------+--------------------+--------------------+-----+---+
|0040b5f0|2018-01-22 13:04:...|2018-01-22 00:00:...|  1.0|1.0|
|0040b5f0|2018-01-22 13:04:...|2018-01-22 00:00:...|  0.0|1.0|
|0040b5f0|2018-01-25 18:55:...|2018-01-25 00:00:...|  1.0|2.0|
|0040b5f0|2018-01-25 18:56:...|2018-01-25 00:00:...|  1.0|3.0|
|0040b5f0|2018-01-25 20:51:...|2018-01-25 00:00:...|  1.0|4.0|
|0040b5f0|2018-01-31 07:48:...|2018-01-31 00:00:...|  1.0|1.0|
|0040b5f0|2018-01-31 07:48:...|2018-01-31 00:00:...|  0.0|1.0|
|0040b5f0|2018-02-02 09:40:...|2018-02-02 00:00:...|  1.0|2.0|
|0040b5f0|2018-02-02 09:41:...|2018-02-02 00:00:...|  0.0|2.0|
|0040b5f0|2018-02-05 14:03:...|2018-02-05 00:00:...|  1.0|2.0|
+--------+--------------------+--------------------+-----+---+

我没有使用“日期”栏。考虑到这一点，不确定我们如何才能满足您的要求。因此，如果 TS 的日期可能与日期列不同，则此解决方案不涵盖它。

注意：接受

rangeBetween

参数的

Column

已在接受日期/时间戳类型列的 Spark 2.3.0 中引入。所以，这个解决方案可能更优雅。

Answer 2

你知道如何解决了吗？我正在尝试做类似的事情，但我已经挣扎了一段时间

PySpark 窗口函数：rangeBetween/rowsBetween 上的 orderBy 中的多个条件

问题描述投票：0回答：2

2个回答

最新问题

PySpark 窗口函数：rangeBetween/rowsBetween 上的 orderBy 中的多个条件

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2