我有一个大约850k行数据的数据库,其中包含带有日期和参考编号的客户列表。数据已经通过使用多个CSV文件进行了汇总,因此我在数据库中有很多重复的数据,现在我尝试根据一组规则删除这些数据。您会注意到,数据集中一致的一件事是参考号,因为无论他们的记录被添加到数据库中的次数是多少,每个客户都有一个唯一的参考号。
我整理了一个样本数据集,尝试看看是否可以构建逻辑,以下是我的create和insert语句:
CREATE TABLE `rules_sample` (
`id` int(11) NOT NULL DEFAULT 0,
`Name` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
`Start_Date` date DEFAULT NULL,
`End_Date` date DEFAULT NULL,
`Ref_No` mediumtext CHARACTER SET utf8 DEFAULT NULL,
`Filename` varchar(255) CHARACTER SET utf8 DEFAULT NULL
);
INSERT INTO `rules_sample` (`id`, `Name`, `Start_Date`, `End_Date`, `Ref_No`, `Filename`) VALUES
(172251, 'Mr. Humpty Dumpty', '2018-01-01', '2018-01-30', '110001239', 'Unknown'),
(956757, 'Humpty Dumpty', '2018-02-01', '2019-02-01', '110001239', 'Main 1'),
(957765, 'Humpty Dumpty', '2017-02-01', '2018-02-01', '110001239', 'Main 1'),
(958415, 'Humpty Dumpty', '2016-02-01', '2017-01-31', '110001239', 'Main 1'),
(958635, 'Mr Humpty Dumpty', '2014-11-13', '2015-11-13', '110001239', 'Main 1'),
(1104524, 'Mr. Humpty Dumpty', '2018-01-30', '2017-08-03', '110001239', 'Unknown'),
(1104662, 'Humpty Dumpty', '2018-02-01', '2017-08-05', '110001239', 'Unknown'),
(1114207, 'Humpty Dumpty', '2017-02-01', '2018-02-01', '110001239', 'Unknown'),
(1114504, 'Mr Humpty Dumpty', '2014-11-13', '2015-11-13', '110001239', 'Unknown'),
(1348915, 'Mr. Humpty Dumpty', '2018-01-30', '2019-01-30', '110001239', 'Other_Data'),
(92625, 'Mickey Mouse', '2018-08-09', '2018-08-08', '110003936', 'Unknown'),
(93713, 'Mr&Mrs M Mouse', '2017-06-23', '2019-06-22', '110003936', 'Unknown'),
(94978, 'Mickey Mouse', '2018-08-09', '2020-08-08', '110003936', 'Unknown'),
(847136, 'Mickey Mouse', '2020-08-08', '2020-08-08', '110003936', 'Data'),
(847193, 'Mickey Mouse', '2018-08-08', '2018-08-08', '110003936', 'Data'),
(847379, 'Mr&Mrs M Mouse', '2019-06-22', '2019-06-22', '110003936', 'Data'),
(858126, 'Mr&Mrs M Mouse', '2019-08-08', '2019-08-08', '110003936', 'Data'),
(1288753, 'Mr&Mrs M Mouse', '2018-06-22', '2019-06-22', '110003936', 'ABC Services'),
(930743, '.', '2020-08-14', '2020-08-14', '116000074', 'ABC Services'),
(930980, '.', '2020-07-22', '2020-07-22', '116000074', 'ABC Services'),
(931226, '.', '2020-06-30', '2020-06-30', '116000074', 'ABC Services'),
(931804, '.', '2020-05-13', '2020-05-13', '116000074', 'ABC Services'),
(932008, '.', '2020-05-03', '2020-05-03', '116000074', 'ABC Services'),
(932230, '.', '2020-04-26', '2020-04-26', '116000074', 'ABC Services'),
(932644, '.', '2020-04-10', '2020-04-10', '116000074', 'ABC Services'),
(933416, '.', '2020-03-17', '2020-03-17', '116000074', 'ABC Services'),
(933591, '.', '2020-03-08', '2020-03-08', '116000074', 'ABC Services'),
(933887, '.', '2020-02-27', '2020-02-27', '116000074', 'ABC Services'),
(934965, '.', '2020-01-21', '2020-01-21', '116000074', 'ABC Services');
现在,我的规则基于2个因素,1)最高结束日期2)基于文件名的优先级列表(如下所示。)>
Priority | Filename 1 | Main 1 2 | Data 3 | ABC Services 4 | Other_Data 5 | Unknown
我已采取的第一步是编写以下代码以对结束日期进行排名:
SELECT T1.*, Row_Number() OVER (PARTITION BY Ref_No ORDER BY End_Date DESC) rank FROM rules_sample T1 ;
这给了我以下输出:
id Name Start_Date End_Date Mpan_MPR Data_Source rank 956757 Humpty Dumpty 2018-02-01 2019-02-01 110001239 Main 1 1 1348915 Mr. Humpty Dumpty 2018-01-30 2019-01-30 110001239 Other_Data 2 957765 Humpty Dumpty 2017-02-01 2018-02-01 110001239 Main 1 3 1114207 Humpty Dumpty 2017-02-01 2018-02-01 110001239 Unknown 4 172251 Mr. Humpty Dumpty 2018-01-01 2018-01-30 110001239 Unknown 5 1104662 Humpty Dumpty 2018-02-01 2017-08-05 110001239 Unknown 6 1104524 Mr. Humpty Dumpty 2018-01-30 2017-08-03 110001239 Unknown 7 958415 Humpty Dumpty 2016-02-01 2017-01-31 110001239 Main 1 8 958635 Mr Humpty Dumpty 2014-11-13 2015-11-13 110001239 Main 9 1114504 Mr Humpty Dumpty 2014-11-13 2015-11-13 110001239 Unknown 10 94978 Mickey Mouse 2018-08-09 2020-08-08 110003936 Unknown 1 847136 Mickey Mouse 2020-08-08 2020-08-08 110003936 Data 2 858126 Mr&Mrs M Mouse 2019-08-08 2019-08-08 110003936 Data 3 93713 Mr&Mrs M Mouse 2017-06-23 2019-06-22 110003936 Unknown 4 847379 Mr&Mrs M Mouse 2019-06-22 2019-06-22 110003936 Data 5 1288753 Mr&Mrs M Mouse 2018-06-22 2019-06-22 110003936 ABC Services 6 92625 Mickey Mouse 2018-08-09 2018-08-08 110003936 Unknown 7 847193 Mickey Mouse 2018-08-08 2018-08-08 110003936 Data 8 930743 . 2020-08-14 2020-08-14 116000074 ABC Services 1 930980 . 2020-07-22 2020-07-22 116000074 ABC Services 2 931226 . 2020-06-30 2020-06-30 116000074 ABC Services 3 931804 . 2020-05-13 2020-05-13 116000074 ABC Services 4 932008 . 2020-05-03 2020-05-03 116000074 ABC Services 5 932230 . 2020-04-26 2020-04-26 116000074 ABC Services 6 932644 . 2020-04-10 2020-04-10 116000074 ABC Services 7 933416 . 2020-03-17 2020-03-17 116000074 ABC Services 8 933591 . 2020-03-08 2020-03-08 116000074 ABC Services 9 933887 . 2020-02-27 2020-02-27 116000074 ABC Services 10 934965 . 2020-01-21 2020-01-21 116000074 ABC Services 11
现在我正在努力的是如何将优先级列表包括到我的代码中。我的最终输出应如下所示:
id Name Start_Date End_Date Mpan_MPR Data_Source 956757 Humpty Dumpty 2018-02-01 2019-02-01 110001239 Main 1 847136 Mickey Mouse 2020-08-08 2020-08-08 110003936 Data 930743 . 2020-08-14 2020-08-14 116000074 ABC Services
我将解释如何计算最终输出的细目:
对于110001239,这很简单,排名1的End_Date具有文件名Main 1,该文件名在列表中的优先级为1,因此其余数据应被删除。
110003936有点棘手,因为最高的End_Date具有文件名Unknown,但是其优先级为5,现在第二个End_Date与第一个End_Date具有相同的End_Date,并且此文件名是Data,其优先级为2,因此应保留等级2的行,并删除其余的行。
116000074,很简单,因为应该保留End_Date的等级1,因为我们在所有记录中只有1个文件名。
要注意的关键事项之一是End_Date覆盖文件名的优先级。我的代码将需要记录我在同一Ref_No中可能具有所有文件名的记录,但这将基于最高的结束日期。
希望这一切都有道理。
我有一个大约850k行数据的数据库,其中包含带有日期和参考编号的客户列表。数据已通过使用许多CSV文件进行汇总,因此我有一个...
如果我正确地遵循了您的说明,则可以在窗口函数中使用string function field()
作为第二个排序标准: