我在 PySpark 中有一个数据框,如下所示:
row_id, group_id
1, 1
2, null
3, null
4, null
5, 5
6, null
7, null
8, 8
9, null
10, null
11, null
12, null
依此类推:其中 row_id 是顺序号(增量且唯一),group_id 是从值第一次出现到下一个值的位置开始的组的唯一 ID。 任务是将所有空值填充到数据框中,如下所示:
row_id, group_id
1, 1
2, 1
3, 1
4, 1
5, 5
6, 5
7, 5
8, 8
9, 8
10, 8
11, 8
12, 8
每组中的记录数量未知(样本显示数量很少),但可能是 100 条,数据帧的长度为数百万。
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Your original dataframe
data = [(1, 1), (2, None), (3, None), (4, None), (5, 5), (6, None), (7, None), (8, 8), (9, None), (10, None), (11, None), (12, None)]
columns = ["row_id", "group_id"]
df = spark.createDataFrame(data, columns)
# Define a window specification
windowSpec = Window.orderBy("row_id").rowsBetween(Window.unboundedPreceding, 0)
# Use the last window function to fill null values with the last non-null value
filled_df = df.withColumn("group_id", F.last("group_id", ignorenulls=True).over(windowSpec))
# Show the resulting dataframe
filled_df.show()
+------+--------+
|row_id|group_id|
+------+--------+
| 1| 1|
| 2| 1|
| 3| 1|
| 4| 1|
| 5| 5|
| 6| 5|
| 7| 5|
| 8| 8|
| 9| 8|
| 10| 8|
| 11| 8|
| 12| 8|
+------+--------+