如何删除带分区的hive表中的重复数据?

问题描述 投票:0回答:1

必须删除2023-03-26至2023-07-10之间的重复数据。 我尝试使用此命令从表中删除重复项,但出现错误。

命令:

set hive.exec.dynamic.partition.mode=nonstrict; INSERT OVERWRITE TABLE db.table_name PARTITION(dt) select distinct * from db.table_name where dt >= '2023-03-26' AND dt >= '2023-07-10';

错误:

23/07/26 16:07:46 [LocalJobRunner Map Task Executor #0]: WARN io.CombineHiveRecordReader: Multiple     partitions found; not going to pass a part spec to LLAP IO: {{dt=2023-07-10}} and {{dt=2023-07-11}} 2023-   07-26 16:07:47,952 Stage-1 map = 0%, reduce = 0% 23/07/26 16:07:47 [aabca681-0714-44f6-bc8d-9be6d7fca9fc    main]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use     org.apache.hadoop.mapreduce.TaskCounter instead.

注意:该表的分区只是日期。 示例:

show partitions db.table_name;
dt=2023-07-04
dt=2023-07-05
dt=2023-07-06
dt=2023-07-07
dt=2023-07-08
dt=2023-07-09
dt=2023-07-10
$ hive --version
Hive 2.3.3

希望您能就此提出建议。 谢谢!

hadoop hive duplicates
1个回答
0
投票

你是说这个吗?

INSERT OVERWRITE TABLE db.table_name PARTITION(dt)
SELECT DISTINCT *
FROM db.table_name
WHERE dt BETWEEN '2023-03-26' AND '2023-07-10';
© www.soinside.com 2019 - 2024. All rights reserved.