pyspark用与最后一个非空值相关的某种计算替换空值

问题描述 投票:2回答:1

您好,我的问题与[Fill in null with previously known good value with pyspark)有关但是我的问题的要求有一点变化:

   data:                                        expected output:       
   +------+-----+---------+---------+-----+     +------+-----+---------+---------+-----+
   |  item|store|timestamp|sales_qty|stock|     |  item|store|timestamp|sales_qty|stock|
   +------+-----+---------+---------+-----+     +------+-----+---------+---------+-----+
   |673895|35578| 20180101|        1| null|     |673895|35578| 20180101|        1| null|
   |673895|35578| 20180102|        0|  110|     |673895|35578| 20180102|        0|  110|
   |673895|35578| 20180103|        1| null|     |673895|35578| 20180103|        1|  109|
   |673895|35578| 20180104|        0| null|     |673895|35578| 20180104|        0|  109|
   |673895|35578| 20180105|        0|  109|  => |673895|35578| 20180105|        0|  109|
   |673895|35578| 20180106|        1| null|     |673895|35578| 20180106|        1|  108|
   |673895|35578| 20180107|        0|  108|     |673895|35578| 20180107|        0|  108|
   |673895|35578| 20180108|        0| null|     |673895|35578| 20180108|        0|  108|
   |673895|35578| 20180109|        0| null|     |673895|35578| 20180109|        0|  108|
   |673895|35578| 20180110|        1| null|     |673895|35578| 20180110|        1|  107|
   +------+-----+---------+---------+-----+     +------+-----+---------+---------+-----+

我的预期输出基于最后一个已知的非空值和sales_qty,如果存在sales_qty,则应根据该值调整库存值。我已经尝试了以下逻辑

        my_window = Window.partitionBy('item','store').orderBy('timestamp')
        df = df.withColumn("stock", F.when((F.isnull(F.col('stock'))),F.lag(df.stock).over(my_window)-F.col('sales_qty')).otherwise(F.col('stock')))

但是它仅适用于一个空值,有人可以帮助我达到预期的结果吗?

注意:数量并非总是连续减少,因此需要考虑最后一个非空值来计算新的数量

pyspark pyspark-sql pyspark-dataframes
1个回答
1
投票
您可以尝试这个。我基本上首先生成两列(第一个非null值= 110)和stock2,它们基本上是股票的增量总和,然后将它们彼此相减以获得所需的股票。

from pyspark.sql.window import Window from pyspark.sql import functions as F w=Window().partitionBy("item","store").orderBy("timestamp") w2=Window().partitionBy("item","store").orderBy("timestamp").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing) df.withColumn("stock1", F.when(F.col("stock").isNull(), F.lit(0)).otherwise(F.col("stock")))\ .withColumn("stock2", F.sum("sales_qty").over(w)- F.lit(1))\ .withColumn("first", F.first("stock", True).over(w2))\ .withColumn("stock", F.col("first")-F.col("stock2"))\ .drop("stock1","stock2","first")\ .show() +------+-----+---------+---------+-----+ | item|store|timestamp|sales_qty|stock| +------+-----+---------+---------+-----+ |673895|35578| 20180101| 1| 110| |673895|35578| 20180102| 0| 110| |673895|35578| 20180103| 1| 109| |673895|35578| 20180104| 0| 109| |673895|35578| 20180105| 0| 109| |673895|35578| 20180106| 1| 108| |673895|35578| 20180107| 0| 108| |673895|35578| 20180108| 0| 108| |673895|35578| 20180109| 0| 108| |673895|35578| 20180110| 1| 107| +------+-----+---------+---------+-----+

[如果您想将第一个值强制为null而不是110(如所需输出所示),则可以使用它。(基本上使用行号将第一个110值替换为null):

from pyspark.sql.window import Window from pyspark.sql import functions as F w=Window().partitionBy("item","store").orderBy("timestamp") w2=Window().partitionBy("item","store").orderBy("timestamp").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing) df.withColumn("stock1", F.when(F.col("stock").isNull(), F.lit(0)).otherwise(F.col("stock")))\ .withColumn("stock2", F.sum("sales_qty").over(w)- F.lit(1))\ .withColumn("first", F.first("stock", True).over(w2))\ .withColumn("stock", F.col("first")-F.col("stock2"))\ .withColumn("num", F.row_number().over(w))\ .withColumn("stock", F.when(F.col("num")==1, F.lit(None)).otherwise(F.col("stock")))\ .drop("stock1","stock2","first","num")\ .show() +------+-----+---------+---------+-----+ | item|store|timestamp|sales_qty|stock| +------+-----+---------+---------+-----+ |673895|35578| 20180101| 1| null| |673895|35578| 20180102| 0| 110| |673895|35578| 20180103| 1| 109| |673895|35578| 20180104| 0| 109| |673895|35578| 20180105| 0| 109| |673895|35578| 20180106| 1| 108| |673895|35578| 20180107| 0| 108| |673895|35578| 20180108| 0| 108| |673895|35578| 20180109| 0| 108| |673895|35578| 20180110| 1| 107| +------+-----+---------+---------+-----+

附加数据

输入和输出:

#input1 +------+-----+---------+---------+-----+ | item|store|timestamp|sales_qty|stock| +------+-----+---------+---------+-----+ |673895|35578| 20180101| 1| null| |673895|35578| 20180102| 0| 110| |673895|35578| 20180103| 1| null| |673895|35578| 20180104| 3| null| |673895|35578| 20180105| 0| 109| |673895|35578| 20180106| 1| null| |673895|35578| 20180107| 0| 108| |673895|35578| 20180108| 4| null| |673895|35578| 20180109| 0| null| |673895|35578| 20180110| 1| null| +------+-----+---------+---------+-----+ #output1 +------+-----+---------+---------+-----+ | item|store|timestamp|sales_qty|stock| +------+-----+---------+---------+-----+ |673895|35578| 20180101| 1| null| |673895|35578| 20180102| 0| 110| |673895|35578| 20180103| 1| 109| |673895|35578| 20180104| 3| 106| |673895|35578| 20180105| 0| 106| |673895|35578| 20180106| 1| 105| |673895|35578| 20180107| 0| 105| |673895|35578| 20180108| 4| 101| |673895|35578| 20180109| 0| 101| |673895|35578| 20180110| 1| 100| +------+-----+---------+---------+-----+ #input2 +------+-----+---------+---------+-----+ | item|store|timestamp|sales_qty|stock| +------+-----+---------+---------+-----+ |673895|35578| 20180101| 1| null| |673895|35578| 20180102| 0| 110| |673895|35578| 20180103| 1| null| |673895|35578| 20180104| 7| null| |673895|35578| 20180105| 0| 102| |673895|35578| 20180106| 0| null| |673895|35578| 20180107| 4| 98| |673895|35578| 20180108| 0| null| |673895|35578| 20180109| 0| null| |673895|35578| 20180110| 1| null| +------+-----+---------+---------+-----+ #output2 +------+-----+---------+---------+-----+ | item|store|timestamp|sales_qty|stock| +------+-----+---------+---------+-----+ |673895|35578| 20180101| 1| null| |673895|35578| 20180102| 0| 110| |673895|35578| 20180103| 1| 109| |673895|35578| 20180104| 7| 102| |673895|35578| 20180105| 0| 102| |673895|35578| 20180106| 0| 102| |673895|35578| 20180107| 4| 98| |673895|35578| 20180108| 0| 98| |673895|35578| 20180109| 0| 98| |673895|35578| 20180110| 1| 97| +------+-----+---------+---------+-----+

IF,stock 数量不连续像这样:

df.show() +------+-----+---------+---------+-----+ | item|store|timestamp|sales_qty|stock| +------+-----+---------+---------+-----+ |673895|35578| 20180101| 1| null| |673895|35578| 20180102| 0| 110| |673895|35578| 20180103| 1| null| |673895|35578| 20180104| 7| null| |673895|35578| 20180105| 0| 112| |673895|35578| 20180106| 2| null| |673895|35578| 20180107| 0| 107| |673895|35578| 20180108| 0| null| |673895|35578| 20180109| 0| null| |673895|35578| 20180110| 1| null| +------+-----+---------+---------+-----+

您可以使用此 :(我基本上为每个非null的末尾计算一个动态窗口)

from pyspark.sql.window import Window from pyspark.sql import functions as F w=Window().partitionBy("item","store").orderBy("timestamp") w3=Window().partitionBy("item","store","stock5").orderBy("timestamp") df.withColumn("stock1", F.when(F.col("stock").isNull(), F.lit(0)).otherwise(F.col("stock")))\ .withColumn("stock4", F.when(F.col("stock1")!=0, F.rank().over(w)).otherwise(F.col("stock1")))\ .withColumn("stock5", F.sum("stock4").over(w))\ .withColumn("stock6", F.sum("stock1").over(w3))\ .withColumn("sum", F.sum(F.when(F.col("stock1")!=F.col("stock6"),F.col("sales_qty")).otherwise(F.lit(0))).over(w3))\ .withColumn("stock2", F.when(F.col("sales_qty")!=0, F.col("stock6")-F.col("sum")).otherwise(F.col("stock")))\ .withColumn("stock", F.when((F.col("stock2").isNull())&(F.col("sales_qty")==0),F.col("stock6")-F.col("sum")).otherwise(F.col("stock2")))\ .drop("stock1","stock4","stock5","stock6","sum","stock2")\ .show() +------+-----+---------+---------+-----+ | item|store|timestamp|sales_qty|stock| +------+-----+---------+---------+-----+ |673895|35578| 20180101| 1| 0| |673895|35578| 20180102| 0| 110| |673895|35578| 20180103| 1| 109| |673895|35578| 20180104| 7| 102| |673895|35578| 20180105| 0| 112| |673895|35578| 20180106| 2| 110| |673895|35578| 20180107| 0| 107| |673895|35578| 20180108| 0| 107| |673895|35578| 20180109| 0| 107| |673895|35578| 20180110| 1| 106| +------+-----+---------+---------+-----+
© www.soinside.com 2019 - 2024. All rights reserved.