如何通过关键字的出现和最长时间选择记录

问题描述 投票:0回答:1

我正在尝试选择包含某些关键术语的记录组,并在每个组中提取载有最大时间的行。

df1:

id1  id2  name                                 time
1     1    xxxLOAD_TIME                          1
1     1    xxxLOGIN_LOGIN_SESSION_TIMExxx        2
1     1    xxxxSome other timexxxx               3
2     2    xxSome other timex                    1
3     1    xxxLOAD_TIME                          1
3     1    xxSome other timexx                   2

创建b_flag之后(列顺序无所谓)应如下所示。 B_flag表示在id1 + id2组中出现LOGIN_SESSION和LOAD_TIME:

id1  id2  name                             b_flag   time
1     1    xxxLOAD_TIME                      1       1
1     1    xxxLOGIN_LOGIN_SESSION_TIMExxx    1       2
1     1    xxxxSome other timexxxx                   3
2     2    xxSome other timex                        1
3     1    xxxLOAD_TIME                      1       1
3     1    xxSome other timexx                       2

按b_flag过滤

id1  id2  name                             b_flag   time
1     1    xxxLOAD_TIME                      1       1
1     1    xxxLOGIN_LOGIN_SESSION_TIMExxx    1       2
1     1    xxxxSome other timexxxx                   3
3     1    xxxLOAD_TIME                      1       1
3     1    xxSome other timexx                       2

所需的输出(按最大时间过滤):

   id1  id2  name                             b_flag   time
    1     1    xxxxSome other timexxxx                  3
    3     1    xxSome other timexx                      2

以下是我尝试的代码:

create table num1 as
select * 
   from (
   select t.*, sum(b_flag) over(partition by id1,id2) as sum_b_flag,
   max(time) over (partition by id1,id2) max_time,
   ROW_NUMBER() OVER (PARTITION BY id1,id2) as rn /*ensure no duplicates*/
   from (
        select
           t.*,
           case when name LIKE '%LOAD_TIME' or name LIKE '%LOGIN_SESSION_TIME' then 1 end b_flag
        from df1 as t
        ) t
) t

where sum_b_flag > 0 AND name like '%TIME' AND time = max_time AND t.rn = 1

此代码生成以下错误,可能表明内存不足:

错误:执行错误:处理语句时出错:失败:执行错误,从以下处返回代码2org.apache.hadoop.hive.ql.exec.tez.TezTask。顶点失败,vertexName = Reducer 2,vertexId = vertex_1581665816621_0012_183_01,diagnostics = [任务失败,taskId =任务_1581665816621_0012_183_01_000006,诊断信息= [TaskAttempt 0失败,信息= [错误:运行任务时出错(失败):try_1581665816621_0012_183_01_000006_0:java.lang.RuntimeException:java.lang.RuntimeException:org.apache.hadoop.hive.ql.metadata.HiveException:在以下位置处理行时,Hive运行时错误org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)在org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)在org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)在org.apache.tez.runtime.task.TaskRunner2Callable $ 1.run(TaskRunner2Callable.java:73)在org.apache.tez.runtime.task.TaskRunner2Callable $ 1.run(TaskRunner2Callable.java:61)在javax.security.auth.Subject.doAs(Subject.java:422)上的java.security.AccessController.doPrivileged(本机方法)org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730),位于......

sql hadoop hive hiveql cloudera
1个回答
0
投票
select id1,id2,name,b_flag,time
from (
        select
           t.*,
           max(case when name LIKE '%LOAD_TIME%' or name LIKE '%LOGIN_SESSION_TIME%' then 1 end) over (PARTITION BY id1,id2)  b_flag,
           RANK() OVER (PARTITION BY id1,id2 order by time desc) as rn
        from df1 as t
        ) t
where rn=1 and b_flag=1  --latest time and LOAD_TIME or LOGIN_SESSION_TIME appeared in id1, id2 partition
;

如果最大日期只希望一个记录,请使用row_number()而不是rank(),rank()会将1分配给id1,id2分区中具有最大数据的所有记录。

© www.soinside.com 2019 - 2024. All rights reserved.