如何使用linq在列表中查找重复项?

问题描述 投票:0回答:3

我正在从CSV导入人员列表,其中包含具有不同ID的重复人员。 (IOW,此人两次输入到生成CSV的系统中)。导入列表并将其映射到内部对象后,现在需要在List<Person>中标识重复项。

如果两个人的ID不匹配并且以下一个或多个相同,我会认为两个人是匹配的:

  1. 社会保障和姓氏

OR

  1. 名字,姓氏和生日

在SQL中,我将执行以下操作:

select p1.id, p2.id
 from persons as p1
 join persons as p2
 where p1.id != p2.id
 and ( (p1.SSN != null AND p1.SSN = p2.SSN AND p1.Lastname = p2.Lastname)
 or (p1.Firstname = p2.Firstname AND p1.Lastname = p2.Lastname AND p1.Birthdate != null AND p1.Birthdate = p2.Birthdate) )

如何使用List<Person>和Linq完成此操作?

c# linq
3个回答
-1
投票

您可以使用linq查询语法,该语法与您编写的sql非常相似:

var duplicates =
    from p1 in persons
    from p2 in persons
    where p1.Id != p2.Id && (
        (p1.SSN == p2.SSN && p1.LastName == p2.LastName) ||
        (p1.FirstName == p2.FirstName && p1.LastName == p2.LastName && p1.BirthDate == p2.BirthDate))
    select new { Person1 = p1, Person2 = p2 };

-1
投票

实现此目的的最有效方法是使用LINQ .GroupBy

var duplicatedNamesAndBirthdate = persons
    .GroupBy(p => new { p.FirstName, p.LastName, p.BirthDate })
    .Where(g => g.Count() > 1)
    .ToDictionary(
        g => g.Key,        
        g => g.ToArray()
    );

foreach (var pair in duplicatedNamesAndBirthdate)
{
    Console.WriteLine(@$"{pair.Value.Length} people have the duplicated info: 
FirstName={pair.Key.FirstName}, 
LastName={pair.Key.LastName}, 
BirthDate={pair.Key.BirthDate}.");
}

字典将包含匿名对象,这些匿名对象将重复的数据作为键,并将重复的Person的数组作为值。

顺便说一句,在SQL中,也使用group by代替自联接表会更有效。


-1
投票

使用Linq理解查询

也许使用Linq lambda可能存在更好的解决方案,但是您可以使用它:

var query = ( from p1 in persons
              from p2 in persons
              where p1.id != p2.id
                  && ( ( p1.SSN != null && p1.SSN == p2.SSN && p1.Lastname == p2.Lastname )
                    || ( p1.Firstname == p2.Firstname && p1.Lastname == p2.Lastname
                      && p1.Birthdate != null && p1.Birthdate == p2.Birthdate ) )
              select new { person1_ID = p1.id, person2_ID = p2.id }
            ).ToList();

// Remove duplicates results
foreach ( var item in query.ToList() )
  if ( query.Contains(new { person1_ID = item.person2_ID, person2_ID = item.person1_ID }) )
    query.Remove(item);

我还没有检查您的条件逻辑,而是接受了您写的同样的逻辑。

对人使用双循环最快

var duplicates = new List<Tuple<int, int>>();

foreach ( var p1 in persons )
  foreach ( var p2 in persons )
    if ( p1.id != p2.id
      && ( ( p1.SSN != null && p1.SSN == p2.SSN && p1.Lastname == p2.Lastname )
        || ( p1.Firstname == p2.Firstname && p1.Lastname == p2.Lastname
          && p1.Birthdate != null && p1.Birthdate == p2.Birthdate ) ) )
      if ( !duplicates.Contains(new Tuple<int, int>(p2.id, p1.id)) )
        duplicates.Add(new Tuple<int, int>(p1.id, p2.id));

测试数据

public class Person
{
  public int id;
  public string SSN;
  public string Firstname;
  public string Lastname;
  public DateTime? Birthdate;
}

var persons = new List<Person>();

persons.Add(new Person { id = 1, Firstname = "a", Lastname = "a", Birthdate = null, SSN = "1" });
persons.Add(new Person { id = 2, Firstname = "b", Lastname = "b", Birthdate = null, SSN = "1" });
persons.Add(new Person { id = 3, Firstname = "a", Lastname = "a", Birthdate = null, SSN = "1" });

Test Linq

foreach ( var item in query )
  Console.WriteLine($"{item.person1_ID} <=> {{item.person2_ID}}");

测试循环

foreach ( var item in duplicates )
  Console.WriteLine($"{item.Item1} <=> {item.Item2}");

输出

3 <=> 1  // Linq

1 <=> 3  // Loop

基准

在每个解决方案上循环1000000次给出:

1646ms  // Linq with remove duplicates
1323ms  // Linq without remove duplicates

359ms   // Loop
© www.soinside.com 2019 - 2024. All rights reserved.