我正在尝试根据rate_increase获取每个地区的前5个项目。我正在尝试使用spark.sql,如下所示:
输入:
district item rate_increase(%)
Arba coil 500
Arba pen -85
Arba hat 50
Cebu oil -40
Cebu pen 1100
Top5item = spark.sql('select district, item , rate_increase, ROW_NUMBER() OVER (PARTITION BY district ORDER BY rate_increase DESC) AS RowNum from rateTable where rate_increase > 0')
这可行。如何在同一语句中过滤排名前5的产品。我尝试如下,这是通过spar.sql做的更好的方法吗?
Top5item = spark.sql('select district, item from (select NCSA, Product, growthRate, ROW_NUMBER() OVER (PARTITION BY NCSA ORDER BY growthRate DESC) AS RowNum from rateTable where rate_increase > 0) where RowNum <= 5 order by NCSA')
输出:
district item rate_increase(%)
Arba coil 500
Arba hat 50
Cebu pen 1100
谢谢。
请记住查询的执行顺序:
从/加入->位置->分组依据->具有->选择
where子句where RowNum <= 5
不起作用,因为它不知道什么是RowNum
。
尝试使用子查询块:
spark.sql("""
select district, item , `rate_increase(%)` from (
select row_number() over (partition by district order by `rate_increase(%)` desc) as RowNum, district,item, `rate_increase(%)` from ddf_1 where `rate_increase(%)` > 0 )
where RowNum <= 5 order by district, RowNum
""").show()
输出:
+--------+----+----------------+
|district|item|rate_increase(%)|
+--------+----+----------------+
| Arba|coil| 500|
| Arba| hat| 50|
| Cebu| pen| 1100|
+--------+----+----------------+
我也尝试过用熊猫作为简单的解决方案。
Top5item = df.sort_values('rate_increase(%)', ascending = False).groupby(['district']).head(5)
升序在某些地区不起作用。