在 pyspark 中获取最高和最低工资员工姓名

问题描述 投票:0回答:1

我在 pyspark 中有以下场景..有人可以帮忙吗? 我有以下输入:

inputData = [(100,"ABC",2000),(100,"XYZ",1000),(100,"CDE",750),(200,"GYT",1500),(200,"JHU",1200),(200,"GHT",1300),(200,"YTR",8000)]

inputSchema= "deptID int, empName String, empSal int"

df = spark.createDataFrame(inputData,inputSchema)

display(df)

我需要以下输出。有人可以帮忙编写 pyspark 代码吗?

 -----------------------------
|deptID | maxSalEmp |minSalEmp|
 -----------------------------
| 100   |  ABC      | CDE     |
| 200   |  YTR      | JHU     |
 -----------------------------
pyspark
1个回答
0
投票

不用担心,我找到了解决方案..

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

spark = SparkSession.builder.getOrCreate()

inputData = [(100, "ABC", 2000), (100, "XYZ", 1000), (100, "CDE", 750),
             (200, "GYT", 1500), (200, "JHU", 1200), (200, "GHT", 1300), (200, "YTR", 8000)]

inputSchema = "deptID int, empName String, empSal int"

df = spark.createDataFrame(inputData, inputSchema)
windowSpec = Window.partitionBy("deptID").orderBy(col("empSal").desc())

output_df = df.withColumn("maxSalEmp", row_number().over(windowSpec) == 1).filter(col("maxSalEmp")).select("deptID", "empName").withColumnRenamed("empName", "maxSalEmp")

windowSpec = Window.partitionBy("deptID").orderBy(col("empSal").asc())

output_df = output_df.join(df.withColumn("minSalEmp", row_number().over(windowSpec) == 1).filter(col("minSalEmp")).select("deptID", "empName").withColumnRenamed("empName", "minSalEmp"), on="deptID")

output_df.show()
© www.soinside.com 2019 - 2024. All rights reserved.