monotonically_increasing_id 函数使用两次时给出相同的值

问题描述投票：0回答：1

在 Databricks 笔记本中，我使用 monotonically_increasing_id 函数并创建 2 列。

这两列具有相同的值，即 COL1 的每一行中的值与同一行的 COL2 中的值相同。

你能解释一下为什么 monotonically_increasing_id 函数会这样吗

azure-databricks

databricks-sql

1个回答

0
投票

Spark中的monotonically_increasing_id函数为每一行生成一个唯一的ID，确保每个ID值都大于之前的ID值。该函数是确定性的，并且依赖于 DataFrame 的分区（及其顺序）来生成 ID。

当您使用 monotonically_increasing_id 在同一操作或同一转换阶段创建两列时，Spark 会根据每个分区内相同的行顺序来计算这些 ID。由于两列都是在同一分布式计算上下文中计算的，因此每一行将在两列中接收相同的 ID 值。

我尝试过以下方法：

from pyspark.sql.functions import monotonically_increasing_id

df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", 
                    header=True, 
                    inferSchema=True)
df = df.withColumn("COL1", monotonically_increasing_id())\
       .withColumn("COL2", monotonically_increasing_id())
df.show()

结果：

+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+----+----+
|_c0|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|COL1|COL2|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+----+----+
|  1| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|   0|   0|
|  2| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|   1|   1|
|  3| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|   2|   2|
|  4| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|   3|   3|
|  5| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|   4|   4|
|  6| 0.24|Very Good|    J|   VVS2| 62.8| 57.0|  336|3.94|3.96|2.48|   5|   5|
|  7| 0.24|Very Good|    I|   VVS1| 62.3| 57.0|  336|3.95|3.98|2.47|   6|   6|
|  8| 0.26|Very Good|    H|    SI1| 61.9| 55.0|  337|4.07|4.11|2.53|   7|   7|
|  9| 0.22|     Fair|    E|    VS2| 65.1| 61.0|  337|3.87|3.78|2.49|   8|   8|
| 10| 0.23|Very Good|    H|    VS1| 59.4| 61.0|  338| 4.0|4.05|2.39|   9|   9|

在上面的代码中，我添加了两列 ID 单调递增的列。

monotonically_increasing_id 函数使用两次时给出相同的值

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1