如何根据MySQL中的优先级和最高结束日期删除重复的行?

问题描述 投票:2回答:1

我有一个大约850k行数据的数据库,其中包含带有日期和参考编号的客户列表。数据已经通过使用多个CSV文件进行了汇总,因此我在数据库中有很多重复的数据,现在我尝试根据一组规则删除这些数据。您会注意到,数据集中一致的一件事是参考号,因为无论他们的记录被添加到数据库中的次数是多少,每个客户都有一个唯一的参考号。

我整理了一个样本数据集,尝试看看是否可以构建逻辑,以下是我的create和insert语句:

CREATE TABLE `rules_sample` (
 `id` int(11) NOT NULL DEFAULT 0,
 `Name` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
 `Start_Date` date DEFAULT NULL,
 `End_Date` date DEFAULT NULL,
 `Ref_No` mediumtext CHARACTER SET utf8 DEFAULT NULL,
 `Filename` varchar(255) CHARACTER SET utf8 DEFAULT NULL
);


INSERT INTO `rules_sample` (`id`, `Name`, `Start_Date`, `End_Date`, `Ref_No`, `Filename`) VALUES
(172251, 'Mr. Humpty Dumpty', '2018-01-01', '2018-01-30', '110001239', 'Unknown'),
(956757, 'Humpty Dumpty', '2018-02-01', '2019-02-01', '110001239', 'Main 1'),
(957765, 'Humpty Dumpty', '2017-02-01', '2018-02-01', '110001239', 'Main 1'),
(958415, 'Humpty Dumpty', '2016-02-01', '2017-01-31', '110001239', 'Main 1'),
(958635, 'Mr Humpty Dumpty', '2014-11-13', '2015-11-13', '110001239', 'Main 1'),
(1104524, 'Mr. Humpty Dumpty', '2018-01-30', '2017-08-03', '110001239', 'Unknown'),
(1104662, 'Humpty Dumpty', '2018-02-01', '2017-08-05', '110001239', 'Unknown'),
(1114207, 'Humpty Dumpty', '2017-02-01', '2018-02-01', '110001239', 'Unknown'),
(1114504, 'Mr Humpty Dumpty', '2014-11-13', '2015-11-13', '110001239', 'Unknown'),
(1348915, 'Mr. Humpty Dumpty', '2018-01-30', '2019-01-30', '110001239', 'Other_Data'),
(92625, 'Mickey Mouse', '2018-08-09', '2018-08-08', '110003936', 'Unknown'),
(93713, 'Mr&Mrs M Mouse', '2017-06-23', '2019-06-22', '110003936', 'Unknown'),
(94978, 'Mickey Mouse', '2018-08-09', '2020-08-08', '110003936', 'Unknown'),
(847136, 'Mickey Mouse', '2020-08-08', '2020-08-08', '110003936', 'Data'),
(847193, 'Mickey Mouse', '2018-08-08', '2018-08-08', '110003936', 'Data'),
(847379, 'Mr&Mrs M Mouse', '2019-06-22', '2019-06-22', '110003936', 'Data'),
(858126, 'Mr&Mrs M Mouse', '2019-08-08', '2019-08-08', '110003936', 'Data'),
(1288753, 'Mr&Mrs M Mouse', '2018-06-22', '2019-06-22', '110003936', 'ABC Services'),
(930743, '.', '2020-08-14', '2020-08-14', '116000074', 'ABC Services'),
(930980, '.', '2020-07-22', '2020-07-22', '116000074', 'ABC Services'),
(931226, '.', '2020-06-30', '2020-06-30', '116000074', 'ABC Services'),
(931804, '.', '2020-05-13', '2020-05-13', '116000074', 'ABC Services'),
(932008, '.', '2020-05-03', '2020-05-03', '116000074', 'ABC Services'),
(932230, '.', '2020-04-26', '2020-04-26', '116000074', 'ABC Services'),
(932644, '.', '2020-04-10', '2020-04-10', '116000074', 'ABC Services'),
(933416, '.', '2020-03-17', '2020-03-17', '116000074', 'ABC Services'),
(933591, '.', '2020-03-08', '2020-03-08', '116000074', 'ABC Services'),
(933887, '.', '2020-02-27', '2020-02-27', '116000074', 'ABC Services'),
(934965, '.', '2020-01-21', '2020-01-21', '116000074', 'ABC Services');

现在,我的规则基于2个因素,1)最高结束日期2)基于文件名的优先级列表(如下所示。)>

Priority | Filename
1 | Main 1
2 | Data
3 | ABC Services
4 | Other_Data
5 | Unknown

我已采取的第一步是编写以下代码以对结束日期进行排名:

SELECT 
    T1.*,
    Row_Number() OVER (PARTITION BY Ref_No ORDER BY End_Date DESC) rank
FROM 
    rules_sample T1
    ;

这给了我以下输出:

id  Name    Start_Date  End_Date    Mpan_MPR    Data_Source rank
956757  Humpty Dumpty   2018-02-01  2019-02-01  110001239   Main 1 1
1348915 Mr. Humpty Dumpty   2018-01-30  2019-01-30  110001239   Other_Data 2
957765  Humpty Dumpty   2017-02-01  2018-02-01  110001239   Main 1 3
1114207 Humpty Dumpty   2017-02-01  2018-02-01  110001239   Unknown 4
172251  Mr. Humpty Dumpty   2018-01-01  2018-01-30  110001239   Unknown 5
1104662 Humpty Dumpty   2018-02-01  2017-08-05  110001239   Unknown 6
1104524 Mr. Humpty Dumpty   2018-01-30  2017-08-03  110001239   Unknown 7
958415  Humpty Dumpty   2016-02-01  2017-01-31  110001239   Main 1 8
958635  Mr Humpty Dumpty    2014-11-13  2015-11-13  110001239   Main 9
1114504 Mr Humpty Dumpty    2014-11-13  2015-11-13  110001239   Unknown 10
94978   Mickey Mouse    2018-08-09  2020-08-08  110003936   Unknown 1
847136  Mickey Mouse    2020-08-08  2020-08-08  110003936   Data 2
858126  Mr&Mrs M Mouse  2019-08-08  2019-08-08  110003936   Data 3
93713   Mr&Mrs M Mouse  2017-06-23  2019-06-22  110003936   Unknown 4
847379  Mr&Mrs M Mouse  2019-06-22  2019-06-22  110003936   Data 5
1288753 Mr&Mrs M Mouse  2018-06-22  2019-06-22  110003936   ABC Services 6
92625   Mickey Mouse    2018-08-09  2018-08-08  110003936   Unknown 7
847193  Mickey Mouse    2018-08-08  2018-08-08  110003936   Data 8
930743  .   2020-08-14  2020-08-14  116000074   ABC Services 1
930980  .   2020-07-22  2020-07-22  116000074   ABC Services 2
931226  .   2020-06-30  2020-06-30  116000074   ABC Services 3
931804  .   2020-05-13  2020-05-13  116000074   ABC Services 4
932008  .   2020-05-03  2020-05-03  116000074   ABC Services 5
932230  .   2020-04-26  2020-04-26  116000074   ABC Services 6
932644  .   2020-04-10  2020-04-10  116000074   ABC Services 7
933416  .   2020-03-17  2020-03-17  116000074   ABC Services 8
933591  .   2020-03-08  2020-03-08  116000074   ABC Services 9
933887  .   2020-02-27  2020-02-27  116000074   ABC Services 10
934965  .   2020-01-21  2020-01-21  116000074   ABC Services 11

现在我正在努力的是如何将优先级列表包括到我的代码中。我的最终输出应如下所示:

id  Name    Start_Date  End_Date    Mpan_MPR    Data_Source
956757  Humpty Dumpty   2018-02-01  2019-02-01  110001239   Main 1
847136  Mickey Mouse    2020-08-08  2020-08-08  110003936   Data
930743  .       2020-08-14  2020-08-14  116000074   ABC Services

我将解释如何计算最终输出的细目:

对于110001239,这很简单,排名1的End_Date具有文件名Main 1,该文件名在列表中的优先级为1,因此其余数据应被删除。

110003936有点棘手,因为最高的End_Date具有文件名Unknown,但是其优先级为5,现在第二个End_Date与第一个End_Date具有相同的End_Date,并且此文件名是Data,其优先级为2,因此应保留等级2的行,并删除其余的行。

116000074,很简单,因为应该保留End_Date的等级1,因为我们在所有记录中只有1个文件名。

要注意的关键事项之一是End_Date覆盖文件名的优先级。我的代码将需要记录我在同一Ref_No中可能具有所有文件名的记录,但这将基于最高的结束日期。

希望这一切都有道理。

我有一个大约850k行数据的数据库,其中包含带有日期和参考编号的客户列表。数据已通过使用许多CSV文件进行汇总,因此我有一个...

mysql sql case sql-delete window-functions
1个回答
0
投票

如果我正确地遵循了您的说明,则可以在窗口函数中使用string function field()作为第二个排序标准:

© www.soinside.com 2019 - 2024. All rights reserved.