从 HashSet 中选取“任何”项目非常慢 - 如何快速做到这一点？

Question

我目前正在使用贪婪算法做很多工作 - 它们不关心索引等，它们只适用于组/集合。但我发现 85% 的执行时间都花在尝试从 HashSet 中选择一个项目上。

根据 MSDN 文档：

HashSet 类提供高性能的集合操作。一套是一个不包含重复元素的集合，并且其元素没有特定的顺序。

...除了“获取值”之外的所有事情似乎都是如此。

我已经尝试过：

ElementAt(0) - 非常慢，据说是因为 HashSet 生成临时排序，对所有内容进行排序，然后返回最先的内容
首先 - 非常慢（大概它在做同样的事情）

我的期望：

AnyItem() - 从 HashSet 返回一个项目，没有任何保证。可能是第一个，可能是最后一个，可能是随机的（但不要指望这一点）。

...但我似乎想不出办法。

EDIT2：稍微更详细的源代码，以明确为什么 HashSet 中这个缺失功能的一些简单解决方法没有帮助，并希望更好地展示为什么 HashSet 在所有其他方面都是正确的类：

HashSet<MyClass> candidates;
HashSet<MyClass> unvisited;

... // fill unvisited with typically: 100,000 to 10 million items
... // fill candidates with at least 1 item, potentially 100,000's of items

while( candidates.Count > 0 && unvisited.Count > 0 )
{
  var anyItem = candidates.First();

  while( ! CanProcess( anyItem ) ) // CanProcess probable checks some set intersections
  {
     candidates.Remove( anyItem );
     if( candidates.Count > 0 )
        anyItem = candidates.First();
     else
     {
        anyItem = null;
        break;
     }
  }

  if( anyItem == null ) // we've run out of candidates
     break;

  // For the algorithm: "processing" anyItem has a side-effect of 
  // transferring 0 or more items from "unvisited" into "candidates"
  var extraCandidates = Process( anyItem, unvisited );
  // ... Process probably does some set intersections
  
  ... // add all the extraCandidates to candidates
  ... // remove all the extraCandidates from unvisited
  
  candidates.Remove( anyItem );
}

即：典型的贪心算法有几组：一组“下一次迭代的起点”，以及一组或多组（这里我只显示了一组）“尚未处理的数据，并且以某种方式连接到/可从起点到达”。

...一切都很快，唯一慢的是“第一个”调用 - 我没有理由接受第一个，我可以接受任何一个，但我需要接受一些东西！

Answer 1

似乎

HashSet

类没有针对其第一项被重复删除的场景进行优化。每次删除后，该类内部保留的空间不会减少，而是将相应的槽标记为空。该类的枚举器枚举所有内部槽，并在找到非空槽时生成一个值。当

HashSet

的内部空间变得稀疏时，这可能会变得极其低效。例如，曾经容纳 1,000,000 个元素并已减少为单个元素的

HashSet

在生成存储在其单个非空槽中的元素之前必须枚举 1,000,000 个槽：

var set = new HashSet<int>(Enumerable.Range(1, 1_000_000));
set.ExceptWith(Enumerable.Range(1, 999_999));
var item = set.First(); // Very slow

这是一个不容易解决的问题。一种解决方案是在每批删除之后调用

TrimExcess

方法。此方法通过分配新的槽数组、将项目从现有数组复制到新数组，最后丢弃旧数组，最大限度地减少类内部保留的空间。这是一项昂贵的操作，因此过于频繁地调用

TrimExcess

可能会成为应用程序的新瓶颈。

另一种解决方案可能是使用不受此问题困扰的第三方实现。例如， Rock.Collections 库包含一个

OrderedHashSet

类，该类按添加顺序保留项目。它通过以链表方式连接内部槽来实现这一点。类不仅可以按正常顺序枚举，也可以按相反顺序枚举。我没有测试它，但很可能调用

First

应该是 O(1) 操作。

下面是一个解决方案，它使用反射来欺骗内置枚举器从随机索引而不是索引 0 开始枚举插槽。它提供了可接受的性能，但它遇到了已知的反射问题（开销、前向兼容性等）。静态

GetRandom

方法位于通用静态类内部，以便为每种类型

FieldInfo

单独缓存

信息。

public static class HashSetRandomizer<T>
{
    private static FieldInfo _lastIndexField;
    private static FieldInfo _indexField;
    private static ThreadLocal<Random> _random;

    static HashSetRandomizer()
    {
        const BindingFlags FLAGS = BindingFlags.NonPublic | BindingFlags.Instance;
        _lastIndexField = typeof(HashSet<T>).GetField("m_lastIndex", FLAGS) // Framework
            ?? typeof(HashSet<T>).GetField("_lastIndex", FLAGS); // Core
        _indexField = typeof(HashSet<T>.Enumerator).GetField("index", FLAGS) // Framework
            ?? typeof(HashSet<T>.Enumerator).GetField("_index", FLAGS); // Core
        _random = new ThreadLocal<Random>(() => new Random());
    }

    public static T GetRandom(HashSet<T> source, Random random = null)
    {
        if (source == null) throw new ArgumentNullException(nameof(source));
        random = random ?? _random.Value;
        if (_lastIndexField == null)
            throw new NotSupportedException("FieldInfo lastIndex not found.");
        if (_indexField == null)
            throw new NotSupportedException("FieldInfo index not found.");
        if (source.Count > 0)
        {
            int lastIndex = (int)_lastIndexField.GetValue(source);
            if (lastIndex > 0)
            {
                var randomIndex = random.Next(0, lastIndex);
                using (IEnumerator<T> enumerator = source.GetEnumerator())
                {
                    _indexField.SetValue(enumerator, randomIndex);
                    if (enumerator.MoveNext()) return enumerator.Current;
                }
            }
            foreach (var item in source) return item; // Fallback
        }
        throw new InvalidOperationException("The source sequence is empty.");
    }
}

使用示例。项目会从

HashSet

中随机移除，直到集合为空。

var set = new HashSet<int>(Enumerable.Range(1, 1_000_000));
while (set.Count > 0)
{
    var item = HashSetRandomizer<int>.GetRandom(set); // Fast
    set.Remove(item);
}

即使采用这种方法，删除最后几个项目仍然很慢。

Answer 2

始终使用

First()

需要始终构建内部

Enumerator

结构。但由于获取哪个元素并不重要，因此您只需检索

IEnumerator

对象一次，然后继续从中读取数据。所以这基本上是一个正常的

foreach

循环，你必须处理这些条目。

为了防止出现任何“集合已修改”异常，在迭代完成之前，不得从 HashSet 中删除后续条目。因此，您可以保存已进行的条目并在之后将其删除。源代码可能如下所示：

HashSet

当您迭代整个 HashSet 并且不在每个条目后运行“元数据”检查时，您可能需要调整外部

HashSet<MyClass> hs /// approx 500,000 items

while(/* metadata based on what's been processed*/ ) // might be adjusted now
{    
    Set<MyClass> toDelete = new HashSet<MyClass>();
    while (MyClass entry in hs) // get Enumerator only once, then iterate normally
    {
        if(ShouldProcess(entry))
        {
            Process(entry);
            toDelete.Remove(entry);
        }
    }
    // finally delete them
    foreach (MyClass entry in toDelete)
    {
        hs.Remove(entry);
    }
}

循环，因为事实上，在一次外部

while

循环迭代中，整个 HashSet 都会被迭代（并且条目不会立即删除）。

从 HashSet 中选取“任何”项目非常慢 - 如何快速做到这一点？

问题描述投票：0回答：2

2个回答

最新问题

从 HashSet 中选取“任何”项目非常慢 - 如何快速做到这一点？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2