样本数据:
+-----------+------------+---------+
|City |Continent | Price|
+-----------+------------+---------+
| A | Asia | 100|
| B | Asia | 110|
| C | Africa | 60|
| D | Europe | 170|
| E | Europe | 90|
| F | Africa | 100|
+-----------+------------+---------+
输出:对于第二列,我知道我们可以只使用
df.groupby("Continent").agg({'Price':'avg'})
但是我们如何计算第三列?第三列按城市分类属于每个大洲,然后计算平均价格。
预期输出
------------+--------------+----------------------------------------------+
|Continent | Average Price|Average Price for cities not in this continent|
+-----------+--------------+----------------------------------------------+
| Asia | 105| 105 |
| Africa | 80| 117.5 |
| Europe | 130| 92.5 |
+-----------+--------------+----------------------------------------------+
>>> from pyspark.sql.functions import col,avg
>>> df.show()
+----+---------+-----+
|City|Continent|Price|
+----+---------+-----+
| A| Asia| 100|
| B| Asia| 110|
| C| Africa| 60|
| D| Europe| 170|
| E| Europe| 90|
| F| Africa| 100|
+----+---------+-----+
>>> df1 = df.alias("a").join(df.alias("b"), col("a.Continent") != col("b.Continent"),"left").select(col("a.*"), col("b.price").alias("b_price"))
>>> df1.groupBy("Continent").agg(avg(col("Price")).alias("Average Price"), avg(col("b_price")).alias("Average Price for cities not in this continent")).show()
+---------+-------------+----------------------------------------------+
|Continent|Average Price|Average Price for cities not in this continent|
+---------+-------------+----------------------------------------------+
| Europe| 130.0| 92.5|
| Africa| 80.0| 117.5|
| Asia| 105.0| 105.0|
+---------+-------------+----------------------------------------------+