如何为行子组 pyspark 随机化不同的数字

Question

我有一个 pyspark 数据框。我需要在给定条件下随机化从列表中获取的所有行的值。我做到了：

df = df.withColumn('rand_col', f.when(f.col('condition_col') == condition, random.choice(my_list)))

但效果是，它仅随机化一个值并将其分配给所有行：

如何为每一行单独随机化？

Answer 1

您可以：

使用
```
rand
```
中的
```
floor
```
和
```
pyspark.sql.functions
```
创建随机索引列以索引到您的
```
my_list
```
创建一个重复
```
my_list
```
值的列
使用
```
f.col
```

它看起来像这样：

import pyspark.sql.functions as f

my_list = [1, 2, 30]
df = spark.createDataFrame(
    [
        (1, 0),
        (2, 1),
        (3, 1),
        (4, 0),
        (5, 1),
        (6, 1),
        (7, 0),
    ],
    ["id", "condition"]
)

df = df.withColumn('rand_index', f.when(f.col('condition') == 1, f.floor(f.rand() * len(my_list))))\
       .withColumn('my_list', f.array([f.lit(x) for x in my_list]))\
       .withColumn('rand_value', f.when(f.col('condition') == 1, f.col("my_list")[f.col("rand_index")]))

df.show()
+---+---------+----------+----------+----------+
| id|condition|rand_index|   my_list|rand_value|
+---+---------+----------+----------+----------+
|  1|        0|      null|[1, 2, 30]|      null|
|  2|        1|         0|[1, 2, 30]|         1|
|  3|        1|         2|[1, 2, 30]|        30|
|  4|        0|      null|[1, 2, 30]|      null|
|  5|        1|         1|[1, 2, 30]|         2|
|  6|        1|         2|[1, 2, 30]|        30|
|  7|        0|      null|[1, 2, 30]|      null|
+---+---------+----------+----------+----------+

如何为行子组 pyspark 随机化不同的数字

问题描述投票：0回答：1

1个回答

最新问题

如何为行子组 pyspark 随机化不同的数字

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1