DataFrame
,我想获取每个独特
value
的模式(最常见)
name
。我怎样才能做到这一点?似乎没有内置的
mode
功能。 SparkR 或 PySpark 解决方案都可以。
# Create DF
df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
value = c(5, 5, 4, 3, 3, 7))
DF <- createDataFrame(df)
name | value
-----------------
Thomas | 5
Thomas | 5
Thomas | 4
Bill | 3
Bill | 3
Bill | 9
# What I want to get
name | mode(value)
-----------------
Thomas | 5
Bill | 3
.groupBy()
和
window
方法的组合来实现这一点,如下所示:
grouped = df.groupBy('name', 'value').count()
window = Window.partitionBy("name").orderBy(desc("count"))
grouped\
.withColumn('order', row_number().over(window))\
.where(col('order') == 1)\
.show()
输出:
+------+-----+-----+-----+
| name|value|count|order|
+------+-----+-----+-----+
| Bill| 3| 2| 1|
|Thomas| 5| 2| 1|
+------+-----+-----+-----+
grouped <- agg(groupBy(df, 'name', 'value'), count=count(df$value))
window <- orderBy(windowPartitionBy("name"), desc(grouped$count))
dfmode <- withColumn(grouped, 'order', over(row_number(), window))
dfmode <- filter(dfmode, dfmode$order==1)
Spark 3.4+有mode
PySpark 完整示例:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('Thomas', 5),
('Thomas', 5),
('Thomas', 4),
('Bill', 3),
('Bill', 3),
('Bill', 9)],
['name', 'value'])
df.groupBy('name').agg(F.mode('value')).show()
# +------+-----------+
# | name|mode(value)|
# +------+-----------+
# |Thomas| 5|
# | Bill| 3|
# +------+-----------+
SparkR完整示例:
df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
value = c(5, 5, 4, 3, 3, 9))
df <- as.DataFrame(df)
df <- agg(groupBy(df, 'name'), expr("mode(value)"))
showDF(df)
# +------+-----------+
# | name|mode(value)|
# +------+-----------+
# |Thomas| 5.0|
# | Bill| 3.0|
# +------+-----------+