我有一个要使用以下代码进行重复数据删除的数据集:
select session_id, sol_id, id, session_context_code,
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id, date) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1
order by session_id, sol_id, date
我想添加一个变量来存储dedup之后的总行数,并尝试使用count(*):
select session_id, sol_id, id, session_context_code, count(*) as total
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1
order by session_id, sol_id, date
我收到的错误:
错误:执行错误:org.apache.hive.service.cli.HiveSQLException:编译语句时出错:FAILED:SemanticException[错误10025]:第1行:44表达式不在GROUP BY键'session_id'中]
我只想将计数输出为一个变量,该变量在对行号进行重复数据删除之后,按session_id和sol_id对所有不同的记录进行计数。如何将其合并到代码中?
具有COUNT(*)
的Hive查询以及SELECT
子句中的列,应在查询末尾使用GROUP BY对这些列进行分组。
一些样品:
SELECT COUNT(*) FROM employees;
SELECT id, name, COUNT(*) FROM employees GROUP BY id, name;
在您的问题中,查询应如下所示,
select session_id, sol_id, id, session_context_code, count(*) as total
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1
order by session_id, sol_id, date
GROUP BY session_id, sol_id, id, session_context_code
您可以阅读更多HERE
希望这会有所帮助!