[使用正则表达式[重复]时程序挂起

问题描述 投票:4回答:4

我的程序挂在某个点上的字符串可能是由于正则表达式中的无限循环。

我的字符串值为

String input="The contents of this Office Memorandum may also be brought to the notice";

我的代码是

String pat = "(the|The)[ ]?((([A-Za-z0-9])+([ ]| , )?([A-Za-z0-9]))+ [ ]?)+([<][ ]?[A-Za-z0-9- ]+ [,]?[A-Za-z0-9- ]+[ ]?[>])*[ ]?(Act|Rules|Rule|Bill|Regulations)[ ]?[,]?[ ]?([0-9]+)?";
Pattern p = Pattern.compile(pat);
Matcher m = p.matcher(input);

while (m.find()) {
      String sMatch = m.group();
      String temp1=sMatch;
      String temp2=sMatch;       
}

它会进入while循环,但不会退出循环。我认为正则表达式有一些问题。我不明白-这是什么问题?有什么可以解决的。当调试器进入while(m.find())时,它不在外面。

java regex
4个回答
4
投票

您遇到了灾难性的回溯,当找到一个[时发生,但是必须在每个可能的长度组合中检查两个可变长度的不匹配替代项,以证明不匹配。从数学上讲,组合的数量随输入的长度呈指数增长。

这是正则表达式的问题部分:

(([A-Za-z0-9])+([ ]| , )?([A-Za-z0-9]))+

它是字母/数字的一个或多个,可选的空格,然后是字母/数字,以及一个或多个。您可以看到,当输入长度很大时,每个“一个或多个”的数量组合很大。

3
投票
据此:Endless Loop matcher.find()您的正则表达式可能太复杂,并导致“灾难性的回溯”。您可能需要将您的正则表达式简化为仍然可以使用的更简单的正则表达式。

1
投票
正如其他人所说的那样,这种表达遭受灾难性的回溯。但是您可以使用以下代码来解决您的问题,即使它有点难看。

String input = "The contents of this Office Memorandum may also be brought to the notice Act"; String pat = "((the|The) ?((\\w|\\d)+[ ,]?(\\w|\\d)+ ?)+([<] ?[A-Za-z0-9 ]+ ,?[A-Za-z0-9 ]+[ ]?[>])* ?)"; Pattern p = Pattern.compile(pat); Matcher m = p.matcher(input); while (m.find()) { int end = m.end(); String sMatch = m.group(); System.out.println(sMatch); if (end < input.length()) { String next = input.substring(end); String pat1 = "(Act|Rules|Rule|Bill|Regulations)[ ]?[,]?[ ]?([0-9]+)?"; Pattern p1 = Pattern.compile(pat1); Matcher m1 = p1.matcher(next); if (m1.find()) { System.out.println("found"); } } } }


1
投票
这里是一个可能有帮助的Java示例(您应该可以对此进行调整以满足您的需求):

import java.util.regex.Pattern; import java.util.regex.Matcher; class RegexTest { public static void main(String[] args) { String input="The Persons with Disabilities (Equal Opportunities, Protection of Rights and Full Participation) Act, 1995 states that everone must recognise the Corn Act and the Hoopety Fling Regulations 2006 or face death by oranges."; String pat = "(?i)(the\\b)(?> +)((?>(?>(?>\\w|\\)|\\()+(?:,| )*))+?)\\b(Act|Rules|Rule|Bill|Regulations)\\b(?: |,)*(?>(\\d{4})(?:\\D|$))?"; Pattern p = Pattern.compile(pat); Matcher m = p.matcher(input); while (m.find()) { System.out.println(m.group(0).trim().replace(' ', '-')); } } } // Will Output: // The-Persons-with-Disabilities-(Equal-Opportunities,-Protection-of-Rights-and-Full-Participation)-Act,-1995 // the-Corn-Act // the-Hoopety-Fling-Regulations-2006

正如其他人所说的那样,这种表达遭受了灾难性的回溯more info on catastrophic backtracking

我认为表达式应该可以满足您的需求:

(?i)(the\b)(?> +)((?>(?>\w+(?:,| )*))+?)\b(Act|Rules|Rule|Bill|Regulations)\b(?: |,)*(?>(\d{4})(?:\D|$))?

注意原子基团(?>)的使用将有效地“忘记”回溯的可能性。让我知道您是否有任何问题。

更新为包括括号:

(?i)(the\b)(?> +)((?>(?>(?>\w|\)|\()+(?:,| )*))+?)\b(Act|Rules|Rule|Bill|Regulations)\b(?: |,)*(?>(\d{4})(?:\D|$))?

示例字符串:

'The Persons with Disabilities (Equal Opportunities, Protection of Rights and Full Participation) Act, 1995 states that everone must recognise the Corn Act and the Hoopety Fling Regulations 2006 or face death by oranges.'

将产生结果:

[Match number 1] Matched: 'The Persons with Disabilities (Equal Opportunities, Protection of Rights and Full Participation) Act, 1995 ' at character 1 [Capture Group 1] 'The' found at character 1 [Capture Group 2] 'Persons with Disabilities (Equal Opportunities, Protection of Rights and Full Participation) ' found at character 5 [Capture Group 3] 'Act' found at character 98 [Capture Group 4] '1995' found at character 103 [Match number 2] Matched: 'the Corn Act ' at character 143 [Capture Group 1] 'the' found at character 143 [Capture Group 2] 'Corn ' found at character 147 [Capture Group 3] 'Act' found at character 152 [Capture Group 4] '' found at character 1 [Match number 3] Matched: 'the Hoopety Fling Regulations 2006 ' at character 160 [Capture Group 1] 'the' found at character 160 [Capture Group 2] 'Hoopety Fling ' found at character 164 [Capture Group 3] 'Regulations' found at character 178 [Capture Group 4] '2006' found at character 190

© www.soinside.com 2019 - 2024. All rights reserved.