我在 pyspark 中有以下场景..有人可以帮忙吗? 我有以下输入:
inputData = [(100,"ABC",2000),(100,"XYZ",1000),(100,"CDE",750),(200,"GYT",1500),(200,"JHU",1200),(200,"GHT",1300),(200,"YTR",8000)]
inputSchema= "deptID int, empName String, empSal int"
df = spark.createDataFrame(inputData,inputSchema)
display(df)
我需要以下输出。有人可以帮忙编写 pyspark 代码吗?
-----------------------------
|deptID | maxSalEmp |minSalEmp|
-----------------------------
| 100 | ABC | CDE |
| 200 | YTR | JHU |
-----------------------------
不用担心,我找到了解决方案..
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
spark = SparkSession.builder.getOrCreate()
inputData = [(100, "ABC", 2000), (100, "XYZ", 1000), (100, "CDE", 750),
(200, "GYT", 1500), (200, "JHU", 1200), (200, "GHT", 1300), (200, "YTR", 8000)]
inputSchema = "deptID int, empName String, empSal int"
df = spark.createDataFrame(inputData, inputSchema)
windowSpec = Window.partitionBy("deptID").orderBy(col("empSal").desc())
output_df = df.withColumn("maxSalEmp", row_number().over(windowSpec) == 1).filter(col("maxSalEmp")).select("deptID", "empName").withColumnRenamed("empName", "maxSalEmp")
windowSpec = Window.partitionBy("deptID").orderBy(col("empSal").asc())
output_df = output_df.join(df.withColumn("minSalEmp", row_number().over(windowSpec) == 1).filter(col("minSalEmp")).select("deptID", "empName").withColumnRenamed("empName", "minSalEmp"), on="deptID")
output_df.show()