我的数据框看起来像-
id gender age
1 m 27
2 m 39
3 f 99
4 f 11
5 m 46
6 f 60
我希望我的最终数据框看起来像-
id gender age new_age
1 m 27 27
2 m 39 39
3 f 99 43
4 f 11 43
5 m 46 46
6 f 60 60
我的代码-
from pyspark.sql.functions import mean as _mean, stddev as _stddev, col
condition = ((df['age'] >= 18 & df['age'] <=60))
df = df.withColumn("new_age", when(condition, (col("age"))).otherwise(_mean(col('age')))
但我只希望平均数27、39、46和60 ...而不是异常值。如何在pyspark中做到这一点?
这是您可以做的一种方式:
from operator import add
# convert outliers into a list of strings
outliers = [11,99]
outliers_str = '|'.join([str(i) for i in outliers])
# calculate mean without outlier values
mean_val = df.select("age").rdd.flatMap(lambda x: [i for i in x if i not in outliers]).mean()
# replace mean with outlier values
df = df.withColumn('new_age', F.regexp_replace('age', outliers_str, f'{mean_val}').cast('int'))
+---+------+---+-------+
| id|gender|age|new_age|
+---+------+---+-------+
| 1| m| 27| 27|
| 2| m| 39| 39|
| 3| f| 99| 43|
| 4| f| 11| 43|
| 5| m| 46| 46|
| 6| f| 60| 60|
+---+------+---+-------+