使用 groupBy 获取 Spark 列中的模式(最常见)值

问题描述 投票:0回答:3
我有一个 SparkR

DataFrame

,我想获取每个独特 
value
 的模式(最常见)
name
。我怎样才能做到这一点?似乎没有内置的 
mode
 功能。 SparkR 或 PySpark 解决方案都可以。

# Create DF df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"), value = c(5, 5, 4, 3, 3, 7)) DF <- createDataFrame(df) name | value ----------------- Thomas | 5 Thomas | 5 Thomas | 4 Bill | 3 Bill | 3 Bill | 9 # What I want to get name | mode(value) ----------------- Thomas | 5 Bill | 3
    
apache-spark pyspark apache-spark-sql mode sparkr
3个回答
8
投票
您可以使用

.groupBy()

window
 方法的组合来实现这一点,如下所示:

grouped = df.groupBy('name', 'value').count() window = Window.partitionBy("name").orderBy(desc("count")) grouped\ .withColumn('order', row_number().over(window))\ .where(col('order') == 1)\ .show()

输出:

+------+-----+-----+-----+ | name|value|count|order| +------+-----+-----+-----+ | Bill| 3| 2| 1| |Thomas| 5| 2| 1| +------+-----+-----+-----+
    

0
投票
这是解决方案的 SparkR 版本:

grouped <- agg(groupBy(df, 'name', 'value'), count=count(df$value)) window <- orderBy(windowPartitionBy("name"), desc(grouped$count)) dfmode <- withColumn(grouped, 'order', over(row_number(), window)) dfmode <- filter(dfmode, dfmode$order==1)
    

0
投票

Spark 3.4+mode

列功能。

PySpark 完整示例:

from pyspark.sql import functions as F df = spark.createDataFrame( [('Thomas', 5), ('Thomas', 5), ('Thomas', 4), ('Bill', 3), ('Bill', 3), ('Bill', 9)], ['name', 'value']) df.groupBy('name').agg(F.mode('value')).show() # +------+-----------+ # | name|mode(value)| # +------+-----------+ # |Thomas| 5| # | Bill| 3| # +------+-----------+

SparkR完整示例:

df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"), value = c(5, 5, 4, 3, 3, 9)) df <- as.DataFrame(df) df <- agg(groupBy(df, 'name'), expr("mode(value)")) showDF(df) # +------+-----------+ # | name|mode(value)| # +------+-----------+ # |Thomas| 5.0| # | Bill| 3.0| # +------+-----------+
    
© www.soinside.com 2019 - 2024. All rights reserved.