Hive:使用平均函数和大多数频率函数分组

问题描述 投票:1回答:1

我有这样的表结构

|---------------------|----------|-----------|
|    col_1            |  col_2   |   col_3   |
|---------------------|----------|-----------|
|  2018-01-15 17:56   | A        |   3       |
|---------------------|----------|-----------|
|  2018-01-15 17:56   | A        |   2       |
|---------------------|----------|-----------|
|  2018-10-23 23:43   | B        |   True    |
|---------------------|----------|-----------|
|  2018-10-23 23:43   | B        |   False   |
|---------------------|----------|-----------|
|  2018-10-23 23:43   | A        |    3      |
|---------------------|----------|-----------|
|  2018-10-23 23:43   | B        |    True   |
|---------------------|----------|-----------|

我想按col_1分组,如果col_3为A,则取col_2的平均值,如果col_3为B,则取col_2的频繁值。期望的结果是

|---------------------|----------|-----------|
|    col_1            |  A       |   B       |
|---------------------|----------|-----------|
|  2018-01-15 17:56   | 2.5      |   Null    |
|---------------------|----------|-----------|
|  2018-10-23 23:43   | 3        |   True    |
|---------------------|----------|-----------|

col_2为B时没有频率函数,我知道我可以做这样的事情

select col_1,
       avg(case when col_2='A' then col_3 end) as A
from my_table
group by col_1

col_2为B时如何添加频率功能?

sql group-by hive hiveql
1个回答
0
投票

使用分析功能,请参见代码中的注释:

with my_table as (
select stack(6,
'2018-01-15 17:56','A', '3'    ,
'2018-01-15 17:56','A', '2'    ,
'2018-10-23 23:43','B', 'True' ,
'2018-10-23 23:43','B', 'False',
'2018-10-23 23:43','A', '3'    ,
'2018-10-23 23:43','B', 'True' ) as (col_1 , col_2,  col_3)
)
select col_1, --final aggregation by col_1
       max(avg)           as A,
       max(most_frequent) as B
from(       
select col_1, col_2, col_3, cnt, --calculate avg and most_frequent
       case when col_2='A' then avg(col_3) over(partition by col_1, col_2) else null end as avg,
       case when col_2='B' then first_value(col_3) over(partition by col_1, col_2 order by cnt desc) else null end as most_frequent
  from
      (
      select  col_1, col_2, col_3, --calculate count
              case when col_2='B' then count(*) over(partition by col_1, col_2, col_3) else null end as cnt
        from my_table
      )s  
)s      
group by col_1      
;

结果:

col_1                   a       b
2018-01-15 17:56        2.5     NULL
2018-10-23 23:43        3.0     True
© www.soinside.com 2019 - 2024. All rights reserved.