SELECT a.*
FROM MRSVoid.dbo.Customer_Dataset$ a
CROSS JOIN
(SELECT
[Customer_LastName]
,[Customer_FirstName]
,[Customer_AddressLine1]
,[Customer_HomePhone]
,[Customer_InternetEmail]
FROM MRSVoid.dbo.Customer_Dataset$
GROUP BY [Customer_LastName],
[Customer_FirstName],
[Customer_AddressLine1],
[Customer_InternetEmail],
[Customer_HomePhone]
HAVING count(*) > 1) b
where ((a.Customer_LastName = b.Customer_LastName) OR (a.Customer_LastName is NULL AND b.Customer_LastName is NULL))
AND ((a.Customer_FirstName = b.Customer_FirstName) OR (a.Customer_FirstName is NULL AND b.Customer_FirstName is NULL))
AND ((a.Customer_AddressLine1 = b.Customer_AddressLine1) OR (a.Customer_AddressLine1 is NULL AND b.Customer_AddressLine1 is NULL))
AND ((a.Customer_InternetEmail = b.Customer_InternetEmail) OR (a.Customer_InternetEmail is NULL AND b.Customer_InternetEmail is NULL))
AND ((a.Customer_HomePhone = b.Customer_HomePhone) OR (a.Customer_HomePhone is NULL AND b.Customer_HomePhone is NULL))
order by Customer_AddressLine1
此查询为我提供了来自数据集的重复行,现在我需要合并为每个组的单个记录,并且数据合并的方式使我们拥有尽可能完整的属性集。示例:a。如果两个重复记录共享一个电子邮件地址,但只有一个具有完整的邮寄地址,则生成的合并记录应包含电子邮件地址和邮寄地址。湾如果两个重复记录对于以下之一具有不同的值,则合并记录应使用由ModifiedOn和/或CreatedOn时间戳值标识的更新近的属性。
样本数据
ID CreatedOn ModifiedOn Customer_LastName Customer_FirstName Customer_AddressLine1 Customer_City Customer_State Customer_Zip Customer_HomePhone Customer_InternetEmail
27196 2012-11-14 18:51:07.000 2012-11-17 15:28:45.000 NULL David 98 Pelmor Dr Marmora OR 85044 NULL NULL
14983 2012-11-18 14:02:44.000 2012-11-18 14:02:44.000 NULL David 98 Pelmor Dr Marmora OR 85044 NULL NULL
您可以使用row_number()
窗口功能
with cte as
(
SELECT a.*
FROM MRSVoid.dbo.Customer_Dataset$ a
CROSS JOIN
(SELECT
[Customer_LastName]
,[Customer_FirstName]
,[Customer_AddressLine1]
,[Customer_HomePhone]
,[Customer_InternetEmail]
FROM MRSVoid.dbo.Customer_Dataset$
GROUP BY [Customer_LastName],
[Customer_FirstName],
[Customer_AddressLine1],
[Customer_InternetEmail],
[Customer_HomePhone]
HAVING count(*) > 1) b
where ((a.Customer_LastName = b.Customer_LastName) OR (a.Customer_LastName is NULL AND b.Customer_LastName is NULL))
AND ((a.Customer_FirstName = b.Customer_FirstName) OR (a.Customer_FirstName is NULL AND b.Customer_FirstName is NULL))
AND ((a.Customer_AddressLine1 = b.Customer_AddressLine1) OR (a.Customer_AddressLine1 is NULL AND b.Customer_AddressLine1 is NULL))
AND ((a.Customer_InternetEmail = b.Customer_InternetEmail) OR (a.Customer_InternetEmail is NULL AND b.Customer_InternetEmail is NULL))
AND ((a.Customer_HomePhone = b.Customer_HomePhone) OR (a.Customer_HomePhone is NULL AND b.Customer_HomePhone is NULL))
)
select * from
(
select *, row_number() over(partition by Customer_LastName,Customer_FirstName, Customer_AddressLine1 order by ModifiedOn desc) as rn from cte
)A where rn=1
不是一个完整的解决方案,更像是一个想法:
SELECT t.CustomerName, q1.Email, q2.MailingAddress
FROM (
SELECT CustomerName
FROM Customers
GROUP BY CustomerName
HAVING COUNT(*)>1
) t
CROSS APPLY (
SELECT TOP 1 c1.Email
FROM Customers c1
WHERE c1.CustomerName=t.CustomerName
AND c1.Email IS NOT NULL
ORDER BY ISNULL(ModifiedOn,CreatedOn) DESC
) q1
CROSS APPLY (
SELECT TOP 1 c1.MailingAddress
FROM Customers c1
WHERE c1.CustomerName=t.CustomerName
AND c1.MailingAddress IS NOT NULL
ORDER BY ISNULL(ModifiedOn,CreatedOn) DESC
) q2
要根据GROUP
合并多行记录,您应该这样做。
SELECT Max(id) as Id,
Max(createdon) as createdon,
Max(modifiedon) as modifiedon
--OTHER COLUMN USING MAX
FROM (
--YOUR CURRENT QUERY
SELECT <YOUR SELECT HERE>
FROM ....
) t
GROUP BY <ColumnNameOnWhichYouWantToGroup>
上面的查询将使用GROUP BY
将多行转换为一行。使用聚合函数MAX
获取正确的值。