.NET Core 2.0正则表达式超时死锁

问题描述 投票:11回答:1

我有一个.NET Core 2.0应用程序,我迭代许多不同大小的文件(600,000)(总共220GB)。

我用它来枚举它们

new DirectoryInfo(TargetPath)
    .EnumerateFiles("*.*", SearchOption.AllDirectories)
    .GetEnumerator()

并使用它迭代它们

Parallel.ForEach(contentList.GetConsumingEnumerable(),
    new ParallelOptions
    {
        MaxDegreeOfParallelism = Environment.ProcessorCount * 2
    },
    file => ...

在其中,我有一个正则表达式列表,然后我使用它扫描文件

Parallel.ForEach(_Rules,
    new ParallelOptions
    {
        MaxDegreeOfParallelism = Environment.ProcessorCount * 2
    },
    rule => ... 

最后,我使用Regex类的实例获取匹配项

RegEx = new Regex(
    Pattern.ToLowerInvariant(),
    RegexOptions.Multiline | RegexOptions.Compiled,
    TimeSpan.FromSeconds(_MaxSearchTime))

这个实例在所有文件之间共享,所以我编译一次。有175种模式应用于文件。

在随机(ish)点,应用程序死锁并且完全没有响应。没有任何尝试/捕获可以阻止这种情况发生。如果我采用完全相同的代码并为.NET Framework 4.6编译它,它的工作没有任何问题。

我已经尝试了很多东西,我目前的测试似乎有效(但我非常警惕!)是不使用INSTANCE,而是每次调用STATIC Regex.Matches方法。我不知道我对性能有多大影响,但至少我没有遇到死锁。

我可以使用一些洞察力或至少作为一个警示故事。

更新:我得到这样的文件列表:

private void GetFiles(string TargetPath, BlockingCollection<FileInfo> ContentCollector)
    {
        List<FileInfo> results = new List<FileInfo>();
        IEnumerator<FileInfo> fileEnum = null;
        FileInfo file = null;
        fileEnum = new DirectoryInfo(TargetPath).EnumerateFiles("*.*", SearchOption.AllDirectories).GetEnumerator();
        while (fileEnum.MoveNext())
        {
            try
            {
                file = fileEnum.Current;
                //Skip long file names to mimic .Net Framework deficiencies
                if (file.FullName.Length > 256) continue;
                ContentCollector.Add(file);
            }
            catch { }
        }
        ContentCollector.CompleteAdding();
    }

在我的Rule类中,这里是相关的方法:

_RegEx = new Regex(Pattern.ToLowerInvariant(), RegexOptions.Multiline | RegexOptions.Compiled, TimeSpan.FromSeconds(_MaxSearchTime));
...
    public MatchCollection Matches(string Input) { try { return _RegEx.Matches(Input); } catch { return null; } }
    public MatchCollection Matches2(string Input) { try { return Regex.Matches(Input, Pattern.ToLowerInvariant(), RegexOptions.Multiline, TimeSpan.FromSeconds(_MaxSearchTime)); } catch { return null; } }

然后,这是匹配代码:

    public List<SearchResult> GetMatches(string TargetPath)
    {
        //Set up the concurrent containers
        ConcurrentBag<SearchResult> results = new ConcurrentBag<SearchResult>();
        BlockingCollection<FileInfo> contentList = new BlockingCollection<FileInfo>();

        //Start getting the file list
        Task collector = Task.Run(() => { GetFiles(TargetPath, contentList); });
        int cnt = 0;
        //Start processing the files.
        Task matcher = Task.Run(() =>
        {
            //Process each file making it as parallel as possible                
            Parallel.ForEach(contentList.GetConsumingEnumerable(), new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 2 }, file =>
            {
                //Read in the whole file and make it lowercase
                //This makes it so the Regex engine does not have
                //to do it for each 175 patterns!
                StreamReader stream = new StreamReader(File.OpenRead(file.FullName));
                string inputString = stream.ReadToEnd();
                stream.Close();
                string inputStringLC = inputString.ToLowerInvariant();

                //Run through all the patterns as parallel as possible
                Parallel.ForEach(_Rules, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 2 }, rule =>
                {
                    MatchCollection matches = null;
                    int matchCount = 0;
                    Stopwatch ruleTimer = Stopwatch.StartNew();

                    //Run the match for the rule and then get our count (does the actual match iteration)
                    try
                    {
                        //This does not work - Causes Deadlocks:
                        //matches = rule.Matches(inputStringLC);

                        //This works - No Deadlocks;
                        matches = rule.Matches2(inputStringLC);

                        //Process the regex by calling .Count()
                        if (matches == null) matchCount = 0;
                        else matchCount = matches.Count;
                    }

                    //Catch timeouts
                    catch (Exception ex)
                    {
                        //Log the error
                        string timeoutMessage = String.Format("****** Regex Timeout: {0} ===> {1} ===> {2}", ruleTimer.Elapsed, rule.Pattern, file.FullName);
                        Console.WriteLine(timeoutMessage);
                        matchCount = 0;
                    }
                    ruleTimer.Stop();

                    if (matchCount > 0)
                    {
                        cnt++;
                        //Iterate the matches and generate our match records
                        foreach (Match match in matches)
                        {
                            //Fill my result object
                            //...

                            //Add the result to the collection
                            results.Add(result);
                        }
                    }
                });
            });
        });

        //Wait until all are done.
        Task.WaitAll(collector, matcher);

        Console.WriteLine("Found {0:n0} files with {1:n0} matches", cnt, results.Count);


        return results.ToList();
    }

更新2我运行的测试没有死锁,但是当它接近结束时,它似乎停滞不前,但我仍然可以使用VS进入攻击过程。然后我意识到我没有在我的测试中设置超时,而我在我发布的代码中做了(rule.Matchesrule.Matches2)。在超时时,它会死锁。没有超时,它没有。两者仍然在.Net Framework 4.6中工作。我需要正则表达式的超时,因为有一些大型文件,其中一些模式停滞不前。

更新3:我一直在播放超时值,它似乎是线程运行的一些组合,超时的异常,以及导致Regex引擎死锁的超时值。我无法准确地确定它,但超时> = 5分钟似乎有所帮助。作为临时修复,我可以将值设置为10分钟,但这不是永久修复!

c# .net-core asp.net-core-2.0
1个回答
0
投票

如果我猜,我会责备正则表达

  • RegexOptions.Compiled未在.NET Core 2.0中实现(source
  • 你的175种模式中的一些可能是slightly evil

这可能会导致.NET Framework 4.6和.NET Core 2.0之间显着的性能差异,这可能导致应用程序无响应。

© www.soinside.com 2019 - 2024. All rights reserved.