我有以下 DataFrame 并使用 Pyspark,我试图得到以下答案:
选择 | 掉落 | 票价 | 提示 | 拖动 |
---|---|---|---|---|
1 | 1 | 4.00 | 4.00 | 1.00 |
1 | 2 | 5.00 | 10.00 | 8.00 |
1 | 2 | 5.00 | 15.00 | 12.00 |
3 | 2 | 11.00 | 12.00 | 17.00 |
3 | 5 | 41.00 | 25.00 | 13.00 |
4 | 6 | 50.00 | 70.00 | 2.00 |
我的查询到目前为止是这样的:
from pyspark.sql import functions as func
from pyspark.sql.functions import desc
df.groupBy('Pick', 'Drop') \
.agg(
func.sum('Fare').alias('FarePick'),
func.sum('Tip').alias('TipPick'),
func.avg('Drag').alias('AvgDragPick'),
func.avg('Drag').alias('AvgDragDrop')) \
.orderBy('Pick').show()
不过,我觉得这似乎不太正确。我有点陷入(4),因为 groupby 似乎不正确。有人可以在这里提出更正建议吗?
我将您的表数据添加到
data
变量中,并将这 4 个步骤分开。
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
spark = SparkSession.builder \
.appName("testSession") \
.getOrCreate()
data = [
(1, 1, 4.00, 4.00, 1.00),
(1, 2, 5.00, 10.00, 8.00),
(1, 2, 5.00, 15.00, 12.00),
(3, 2, 11.00, 12.00, 17.00),
(3, 5, 41.00, 25.00, 13.00),
(4, 6, 50.00, 70.00, 2.00)
]
columns = ["Pick", "Drop", "Fare", "Tip", "Drag"]
df = spark.createDataFrame(data, columns)
# 1 and 2 and 3
df.groupBy('Pick').agg(
func.sum('Fare').alias('TotalFarePick'),
func.sum('Tip').alias('TotalTipPick'),
func.avg('Drag').alias('AvgDragPick')
).orderBy('Pick').show()
# 4
df.groupBy('Drop').agg(
func.avg('Drag').alias('AvgDragDrop')
).orderBy('Drop').show()
spark.stop()
两张表的输出:
+----+-------------+------------+-----------+
|Pick|TotalFarePick|TotalTipPick|AvgDragPick|
+----+-------------+------------+-----------+
| 1| 14.0| 29.0| 7.0|
| 3| 52.0| 37.0| 15.0|
| 4| 50.0| 70.0| 2.0|
+----+-------------+------------+-----------+
+----+------------------+
|Drop| AvgDragDrop|
+----+------------------+
| 1| 1.0|
| 2|12.333333333333334|
| 5| 13.0|
| 6| 2.0|
+----+------------------+