使用Pyspark平均值的异常值处理

问题描述 投票:0回答:1

我的数据框看起来像-

id       gender      age
1          m         27
2          m         39
3          f         99
4          f         11
5          m         46
6          f         60

我希望我的最终数据框看起来像-

id       gender      age       new_age
1          m         27          27
2          m         39          39
3          f         99          43
4          f         11          43
5          m         46          46
6          f         60          60

我的代码-

from pyspark.sql.functions import mean as _mean, stddev as _stddev, col

condition = ((df['age'] >= 18 & df['age'] <=60))
df = df.withColumn("new_age", when(condition, (col("age"))).otherwise(_mean(col('age')))

但我只希望平均数27、39、46和60 ...而不是异常值。如何在pyspark中做到这一点?

pyspark pyspark-sql pyspark-dataframes
1个回答
0
投票

这是您可以做的一种方式:

from operator import add

# convert outliers into a list of strings
outliers = [11,99]
outliers_str = '|'.join([str(i) for i in outliers])

# calculate mean without outlier values
mean_val = df.select("age").rdd.flatMap(lambda x: [i for i in x if i not in outliers]).mean()

# replace mean with outlier values
df = df.withColumn('new_age', F.regexp_replace('age', outliers_str, f'{mean_val}').cast('int'))

+---+------+---+-------+
| id|gender|age|new_age|
+---+------+---+-------+
|  1|     m| 27|     27|
|  2|     m| 39|     39|
|  3|     f| 99|     43|
|  4|     f| 11|     43|
|  5|     m| 46|     46|
|  6|     f| 60|     60|
+---+------+---+-------+
© www.soinside.com 2019 - 2024. All rights reserved.