在DataFrame中实现自动增量列

Question

我正在尝试在DataFrame中实现自动增量列。我已经找到了解决方案，但我想知道是否有更好的方法来做到这一点。

我正在使用monotonically_increasing_id()的pyspark.sql.functions函数。这个问题是从0开始，我希望它从1开始。

所以，我做了以下工作并且工作正常：

(F.monotonically_increasing_id()+1).alias("songplay_id")

dfLog.join(dfSong, (dfSong.artist_name == dfLog.artist) & (dfSong.title == dfLog.song))\
                    .select((F.monotonically_increasing_id()+1).alias("songplay_id"), \
                               dfLog.ts.alias("start_time"), dfLog.userId.alias("user_id"), \
                               dfLog.level, \
                               dfSong.song_id, \
                               dfSong.artist_id, \
                               dfLog.sessionId.alias("session_id"), \
                               dfLog.location, \
                               dfLog.userAgent.alias("user_agent"))

有没有更好的方法来实现我想做的事情？我认为，实现一个udf函数或仅仅是我的工作太多了？

谢谢。-

Answer 1

序列monotonically_increasing_id不能保证连续，但它们可以保证单调递增。您的作业的每个任务都将被分配一个起始整数，在每行中它将增加1，但是您在一个批次的最后一个ID和另一个的第一个ID之间会有间隙。要验证此行为，您可以通过重新分区示例数据框来创建包含两个任务的作业：

import pandas as pd
import pyspark.sql.functions as psf
spark.createDataFrame(pd.DataFrame([[i] for i in range(10)], columns=['value'])) \
    .repartition(2) \
    .withColumn('id', psf.monotonically_increasing_id()) \
    .show()
        +-----+----------+
        |value|        id|
        +-----+----------+
        |    3|         0|
        |    0|         1|
        |    6|         2|
        |    2|         3|
        |    4|         4|
        |    7|8589934592|
        |    5|8589934593|
        |    8|8589934594|
        |    9|8589934595|
        |    1|8589934596|
        +-----+----------+

为了确保您的索引产生连续值，您可以使用窗口函数。

from pyspark.sql import Window
w = Window.orderBy('id')
spark.createDataFrame(pd.DataFrame([[i] for i in range(10)], columns=['value'])) \
    .withColumn('id', psf.monotonically_increasing_id()) \
    .withColumn('id2', psf.row_number().over(w)) \
    .show()
        +-----+---+---+
        |value| id|id2|
        +-----+---+---+
        |    0|  0|  1|
        |    1|  1|  2|
        |    2|  2|  3|
        |    3|  3|  4|
        |    4|  4|  5|
        |    5|  5|  6|
        |    6|  6|  7|
        |    7|  7|  8|
        |    8|  8|  9|
        |    9|  9| 10|
        +-----+---+---+

笔记：

monotonically_increasing_id允许您在读取行时为其设置顺序，它从0开始执行第一个任务并增加但不一定按顺序方式
row_number按顺序索引有序窗口中的行，并从1开始

在DataFrame中实现自动增量列

问题描述投票：0回答：1

1个回答

最新问题

在DataFrame中实现自动增量列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1