Pyspark：对于每个月，请累积前三个月的总和]

Question

我正在使用PYSPARK，并尝试从特定月份开始累计最近3个月的总和：

示例：

Month   Value
Jan/19    1
Feb/19    0
Mar/19    4
Apr/19    5
May/19    0
Jun/19   10

因此，前几个月的每个月的累计金额将是：

Month   Value
Jan/19    1
Feb/19  1 + 0 = 1
Mar/19  1+0+4 = 5
Apr/19  0+4+5 = 9
May/19  4+5+0 = 9
Jun/19  5+0+10 = 15

我很确定我需要使用窗口和分区功能，但是我不知道如何设置它。

有人可以帮我吗？

谢谢

Answer 1

Sample DataFrame：

df.show()
+------+-----+
| Month|Value|
+------+-----+
|Jan/19|    1|
|Feb/19|    0|
|Mar/19|    4|
|Apr/19|    5|
|May/19|    0|
|Jun/19|   10|
+------+-----+

您可以使用window函数，但需要将month列转换为正确的timestamp格式，然后将其转换为long 基于range(3months)或unix time计算timestamp in seconds。您可以按实际数据中的分组列进行分区。（86400是1天以秒为单位）。

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w=Window().orderBy(F.col("Month").cast("long")).rangeBetween(-(86400*89), 0)
df\
.withColumn("Month", F.to_timestamp("Month","MMM/yy"))\
.withColumn("Sum", F.sum("Value").over(w)).show()

+-------------------+-----+---+
|              Month|Value|Sum|
+-------------------+-----+---+
|2019-01-01 00:00:00|    1|  1|
|2019-02-01 00:00:00|    0|  1|
|2019-03-01 00:00:00|    4|  5|
|2019-04-01 00:00:00|    5| 10|
|2019-05-01 00:00:00|    0|  9|
|2019-06-01 00:00:00|   10| 15|
+-------------------+-----+---+

如果您想返回3 months only中的each year。含义[Jan/19仅具有Jan/19值。在这种情况下，应使用partitionBy和Year和orderBy month number的rangeBetween -2 and 0.

w=Window().partitionBy(F.year("Month")).orderBy(F.month("Month")).rangeBetween(-2, 0)
df\
.withColumn("Month", F.to_timestamp("Month","MMM/yy"))\
.withColumn("Sum", F.sum("Value").over(w)).show()

+-------------------+-----+---+
|              Month|Value|Sum|
+-------------------+-----+---+
|2019-01-01 00:00:00|    1|  1|
|2019-02-01 00:00:00|    0|  1|
|2019-03-01 00:00:00|    4|  5|
|2019-04-01 00:00:00|    5|  9|
|2019-05-01 00:00:00|    0|  9|
|2019-06-01 00:00:00|   10| 15|
+-------------------+-----+---+

Pyspark：对于每个月，请累积前三个月的总和]

问题描述投票：2回答：1

1个回答

最新问题

Pyspark：对于每个月，请累积前三个月的总和]

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1