我正在从CSV导入人员列表,其中包含具有不同ID的重复人员。 (IOW,此人两次输入到生成CSV的系统中)。导入列表并将其映射到内部对象后,现在需要在List<Person>
中标识重复项。
如果两个人的ID不匹配并且以下一个或多个相同,我会认为两个人是匹配的:
OR
在SQL中,我将执行以下操作:
select p1.id, p2.id
from persons as p1
join persons as p2
where p1.id != p2.id
and ( (p1.SSN != null AND p1.SSN = p2.SSN AND p1.Lastname = p2.Lastname)
or (p1.Firstname = p2.Firstname AND p1.Lastname = p2.Lastname AND p1.Birthdate != null AND p1.Birthdate = p2.Birthdate) )
如何使用List<Person>
和Linq完成此操作?
您可以使用linq查询语法,该语法与您编写的sql非常相似:
var duplicates =
from p1 in persons
from p2 in persons
where p1.Id != p2.Id && (
(p1.SSN == p2.SSN && p1.LastName == p2.LastName) ||
(p1.FirstName == p2.FirstName && p1.LastName == p2.LastName && p1.BirthDate == p2.BirthDate))
select new { Person1 = p1, Person2 = p2 };
实现此目的的最有效方法是使用LINQ .GroupBy
:
var duplicatedNamesAndBirthdate = persons
.GroupBy(p => new { p.FirstName, p.LastName, p.BirthDate })
.Where(g => g.Count() > 1)
.ToDictionary(
g => g.Key,
g => g.ToArray()
);
foreach (var pair in duplicatedNamesAndBirthdate)
{
Console.WriteLine(@$"{pair.Value.Length} people have the duplicated info:
FirstName={pair.Key.FirstName},
LastName={pair.Key.LastName},
BirthDate={pair.Key.BirthDate}.");
}
字典将包含匿名对象,这些匿名对象将重复的数据作为键,并将重复的Person
的数组作为值。
顺便说一句,在SQL中,也使用group by
代替自联接表会更有效。
使用Linq理解查询
也许使用Linq lambda可能存在更好的解决方案,但是您可以使用它:
var query = ( from p1 in persons
from p2 in persons
where p1.id != p2.id
&& ( ( p1.SSN != null && p1.SSN == p2.SSN && p1.Lastname == p2.Lastname )
|| ( p1.Firstname == p2.Firstname && p1.Lastname == p2.Lastname
&& p1.Birthdate != null && p1.Birthdate == p2.Birthdate ) )
select new { person1_ID = p1.id, person2_ID = p2.id }
).ToList();
// Remove duplicates results
foreach ( var item in query.ToList() )
if ( query.Contains(new { person1_ID = item.person2_ID, person2_ID = item.person1_ID }) )
query.Remove(item);
我还没有检查您的条件逻辑,而是接受了您写的同样的逻辑。
对人使用双循环最快
var duplicates = new List<Tuple<int, int>>();
foreach ( var p1 in persons )
foreach ( var p2 in persons )
if ( p1.id != p2.id
&& ( ( p1.SSN != null && p1.SSN == p2.SSN && p1.Lastname == p2.Lastname )
|| ( p1.Firstname == p2.Firstname && p1.Lastname == p2.Lastname
&& p1.Birthdate != null && p1.Birthdate == p2.Birthdate ) ) )
if ( !duplicates.Contains(new Tuple<int, int>(p2.id, p1.id)) )
duplicates.Add(new Tuple<int, int>(p1.id, p2.id));
测试数据
public class Person
{
public int id;
public string SSN;
public string Firstname;
public string Lastname;
public DateTime? Birthdate;
}
var persons = new List<Person>();
persons.Add(new Person { id = 1, Firstname = "a", Lastname = "a", Birthdate = null, SSN = "1" });
persons.Add(new Person { id = 2, Firstname = "b", Lastname = "b", Birthdate = null, SSN = "1" });
persons.Add(new Person { id = 3, Firstname = "a", Lastname = "a", Birthdate = null, SSN = "1" });
Test Linq
foreach ( var item in query )
Console.WriteLine($"{item.person1_ID} <=> {{item.person2_ID}}");
测试循环
foreach ( var item in duplicates )
Console.WriteLine($"{item.Item1} <=> {item.Item2}");
输出
3 <=> 1 // Linq
1 <=> 3 // Loop
基准
在每个解决方案上循环1000000次给出:
1646ms // Linq with remove duplicates
1323ms // Linq without remove duplicates
359ms // Loop