数据集上的avg()是否产生最有效的RDD?

问题描述 投票:0回答:1

据我所知,这是在Spark中计算平均值的最有效方法:Spark : Average of values instead of sum in reduceByKey using Scala

我的问题是:如果我将高层数据集与groupby一起使用,后跟Spark函数的avg(),那么我会得到相同的RDD吗?我可以信任Catalyst还是应该使用低级RDD?我的意思是,编写低级代码会比数据集产生更好的结果吗?

示例代码:

employees
  .groupBy($"employee")
  .agg(
    avg($"salary").as("avg_salary")
  )

对:

employees
.mapValues(employee => (employee.salary, 1)) // map entry with a count of 1
.reduceByKey {
  case ((sumL, countL), (sumR, countR)) => 
    (sumL + sumR, countL + countR)
}
.mapValues { 
  case (sum , count) => sum / count 
}
scala apache-spark
1个回答
0
投票

println("done")调试,转到http://localhost:4040/stages/,您将获得结果。

val spark = SparkSession
  .builder()
  .master("local[*]")
  .appName("example")
  .getOrCreate()

val employees = spark.createDataFrame(Seq(("employee1",1000),("employee2",2000),("employee3",1500))).toDF("employee","salary")
import spark.implicits._
import org.apache.spark.sql.functions._
// Spark functions
employees
  .groupBy("employee")
  .agg(
    avg($"salary").as("avg_salary")
  ).show()
// your low-level code

println("done")
© www.soinside.com 2019 - 2024. All rights reserved.