C#-任务结果问题

问题描述 投票:-1回答:2

我必须通过字符串比较在758.000个地址的列表中找到重复的地址。到目前为止,我已经完成了什么:

public int StartDuplicateFinder(List<Address> addresses, List<Address> checklist)
{
    int found = 0;

    foreach(Address addr1 in addresses)
    {
        List<Address> addresses2 = checklist.FindAll(
        delegate (Address addr2)
        {
            return addr2.AddressString == addr1.AddressString && addr2.Duplicate == "" 
                && addr2.AddrIndex > addr1.AddrIndex;
        }
        );

        foreach(Address addr2 in addresses2)
        {
            addr2.Duplicate = "1";
            found++;
        }
    }
    return found;
}

这大约需要7个小时(太长),并且可以提供大约93.000个重复!

[为了加快速度,我将checklist分成4部分(200k,200k,200k和158k)并使用List<Address> checklists像这样:Task

public class Worker
{
    private List<Address> addresses = null;
    private List<Addresses> checklist = null
    private int found = 0;

    public Worker(List<Address> _addresses, List<Address> _checklist)
    {
        addresses = _addresses; //always 758.000 addresses

        //with 4 Tasks: Task 1, 2 and 3 = 200.000 addresses, Task 4 = 158.000 addresses
        //with 6 Tasks: Task 1, 2, 3, 4 and 5 with 150.000 addresses, Task 6 with 8.000 addresses
        checklist = _checklist;
    }

    public int StartDuplicateFinder()
    {
        foreach(Address addr1 in addresses)
        {
            List<Address> addresses2 = checklist.FindAll(
            delegate (Address addr2)
            {
                return addr2.AddressString == addr1.AddressString && addr2.Duplicate == "" 
                    && addr2.AddrIndex > addr1.AddrIndex;
            }
            );

            foreach(Address addr2 in addresses2)
            {
                addr2.Duplicate = "1";
                found++;
            }
        }
    }

    public int Found {get {return found;}}
}

private async void StartTasks(List<Task> Tasklist)
{
    foreach (Task t in Tasklist)
    {
        t.Start();
    }
    await Task.WhenAll(Tasklist.ToArray());
}

private void DoSomething()
{
    List<Task> Tasklist = new List<Task>();

    foreach(List<Address> checklist in checklists)
    {
        Worker w = new Worker(addresses, checklist);
        Tasklist.Add(new Task(StartDuplicateFinder));        
    }
    StartTasks(Tasklist);

    //wait until the tasks are finished
    //do other stuff
    ...
}

现在它只运行35分钟!但是,如果我查看发现的重复项,就会有很大的偏差。仅发现约27.000个重复项。

我尝试了几次,每次得到其他结果。

4 Tasks, first run:     4 Tasks, second run:
Task# > duplicates      Task# > duplicates
1     >   749           1     >   689
2     >  2450           2     >  2391
3     > 10304           3     > 10073
4     > 14462           4     > 14282
Sum   > 27965           Sum   > 27435

6 Tasks, first run:     6 Tasks, second run:
Task# > duplicates      Task# > duplicates
1     >    16           1     >    24
2     >    56           2     >    55
3     >   202           3     >   236
4     >   679           4     >   634
5     >   852           5     >   800
6     >  2985           6     >  2981
Sum   >  4790           Sum   >  4730

每次都是758.000个地址的相同列表。

我用TaskThreadBackgroundWorker尝试过,但是我总是得到不同的结果!如果我在1 Task中运行此命令,则结果始终是92.377个重复项(我认为这是正确的)。

有人可以帮助我解决问题吗?

c# winforms multitasking
2个回答
3
投票

问题中没有足够的代码来推测。如果您真的要进行故障排除,则需要一个完整的,独立的存储库。但是,这种行为的典型原因是您正在多个线程中修改共享数据。根据经验,并行工作人员应处理只读数据,并返回其结果以由主线程累加。

但是要考虑使用更有效的数据结构和算法,而不是并行处理。 IE(而不是嵌套循环)联接,请构建哈希表(Dictionary<TKey,TValue>Lookup<TKey,TValue>)以使其在单个线程上更快(至少作为第一步)。 EG之类的东西:

public int StartDuplicateFinder(List<Address> addresses, List<Address> checklist)
{
    int found = 0;

    var checklistByAddressString = checklist.ToLookup(a => a.AddressString, a => a);

    foreach (Address addr1 in addresses)
    {
        var addressMatches = checklistByAddressString[addr1.AddressString];
        var addresses2 = addressMatches.Where(addr2 => addr2.Duplicate == ""
                && addr2.AddrIndex > addr1.AddrIndex);

        foreach (Address addr2 in addresses2)
        {
            addr2.Duplicate = "1";
            found++;
        }
    }
    return found;
}

1
投票

问题是您正在循环浏览addresses列表中的项目。您正在创建比赛条件。

您的过滤器具有,作为条件的一部分:

addr2.Duplicate == ""

稍后您更改项目:

addr2.Duplicate = "1"`.

也许最好对addresses进行分块,然后对完整的checkList不利

或者,仅对两个列表在单线程上使用LINQ query。您很可能会获得更快的结果,然后手动遍历集合。

© www.soinside.com 2019 - 2024. All rights reserved.