我有一个与我的数据中的特定模式匹配的正则表达式,然后我使用字符串操作对其进行处理。这些模式由圆括号和句号组成:
((((((.))))))(...(((((..).).)))).
我当前的正则表达式:
(?:\([\(.]+\()(?:[^()]*\.+[^()]*)\)[\).]+\)
找到最长的可能匹配,然后根据左括号和右括号(匹配)的最小值以及点的最大值进行检查。
但是,我发现我遇到了问题 - 有时最长的匹配符合最小括号数的标准,但点太多,因此不会提取模式,即使模式内有拉伸确实符合标准。
我不知道如何为此编码!这是一个嵌套模式问题,还是可以将其作为常规字符串处理?
这是我处理此问题的代码部分。
def function_for_processing(file_name, dots, brackets):
with open(file_name, 'r') as file:
for line in file:
if line.startswith('(') or line.startswith('.') or line.startswith(')'):
# Operate only on lines with bracket/dot annotation
matches = re.finditer(r'(?:\([\(.]+\()(?:[^()]*\.+[^()]*)\)[\).]+\)', line)
for match in matches:
loop_dots = re.findall(r'(?<=\()([^()]*\.{2}[^()]*)(?=\))', match.group(0))
internal_dots_removed = re.sub(r'(?<=\()([^()]*\.{2}[^()]*)(?=\))', '', match.group(0))
loop_count = sum([char == '.' for char in str(loop_dots)])
dot_count = internal_dots_removed.count(".")
opening_count = internal_dots_removed.count("(")
closing_count = internal_dots_removed.count(")")
if dot_count <= dots and opening_count == closing_count >= brackets and loop_count >= 3:
#Further processing continues...
一个例子:
如果点 = 20 且括号 = 19
这个数据点:
(((((((((....(((.((..(((((.((((((((.(.(((((((((((((((.((((((..((......)))))))).))))))))))).)))).).))))....)))).)).)))....)))))...))).))))))
其中有 51 个左括号、51 个右括号和 31 个点(不包括内部点,不计入最大值)被忽略,但它包含一个想法匹配:
(((((.((((((((.(.(((((((((((((((.((((((..((......)))))))).))))))))))).)))).).))))....)))).)).)))
内部点两侧的括号数量最少,且相等,低于最大点数,并且以括号开始和结束。
如果我正确理解你的问题,你希望修改你的正则表达式模式以找到一行中所需模式的最长延伸,但你还想检查该延伸内的较短子字符串是否满足某些条件。
如果是这种情况,您可能会发现将其分解为多个步骤更容易,而不是尝试使用单个正则表达式来实现此目的。
您可以采取以下方法:
import re
from typing import Tuple
def find_longest_pattern(line: str, max_dots: int) -> Tuple[str, int]:
"""
Finds the longest pattern in the given line that satisfies the dot count condition.
Args:
line (str): The input line containing bracket/dot annotations.
max_dots (int): The maximum allowed dot count.
Returns:
Tuple[str, int]: A tuple containing the longest matching pattern and its dot count.
"""
# Find all occurrences of the pattern
matches = re.finditer(r'(?:\([\(.]+\()(?:[^()]*\.+[^()]*)\)[\).]+\)', line)
# Initialize variables to track the longest match and its dot count
longest_match = ""
longest_dot_count = 0
# Iterate over matches
for match in matches:
loop_dots = re.findall(r'(?<=\()([^()]*\.{2}[^()]*)(?=\))', match.group(0))
internal_dots_removed = re.sub(r'(?<=\()([^()]*\.{2}[^()]*)(?=\))', '', match.group(0))
loop_count = sum(char == '.' for char in str(loop_dots))
dot_count = internal_dots_removed.count(".")
# Check if the current match is longer and satisfies conditions
if len(match.group(0)) > len(longest_match) and dot_count <= max_dots:
longest_match = match.group(0)
longest_dot_count = dot_count
return longest_match, longest_dot_count
def process_file(file_name: str, max_dots: int, min_brackets: int) -> None:
"""
Processes the given file, finding and processing patterns that meet specified conditions.
Args:
file_name (str): The name of the file to be processed.
max_dots (int): The maximum allowed dot count for a valid pattern.
min_brackets (int): The minimum required number of opening and closing brackets for a valid pattern.
"""
with open(file_name, 'r') as file:
for line in file:
if line.startswith('(') or line.startswith('.') or line.startswith(')'):
# Operate only on lines with bracket/dot annotation
longest_match, dot_count = find_longest_pattern(line, max_dots)
opening_count = longest_match.count("(")
closing_count = longest_match.count(")")
# Check if the conditions are met
if dot_count <= max_dots and opening_count == closing_count >= min_brackets:
# Further processing continues...
print("Found:", longest_match)
# Example usage
file_name = "your_file.txt"
max_dots = 20
min_brackets = 19
process_file(file_name, max_dots, min_brackets)
这样,
find_longest_pattern
函数专注于查找最长的匹配,然后单独检查该匹配中较短子字符串的条件。这应该可以帮助您更好地控制匹配的逻辑和条件。根据您的具体要求调整if
声明中的条件。