在匹配的正则表达式模式中查找特定区域?

问题描述 投票:0回答:1

我有一个与我的数据中的特定模式匹配的正则表达式,然后我使用字符串操作对其进行处理。这些模式由圆括号和句号组成:

((((((.))))))(...(((((..).).)))).

我当前的正则表达式:

(?:\([\(.]+\()(?:[^()]*\.+[^()]*)\)[\).]+\)

找到最长的可能匹配,然后根据左括号和右括号(匹配)的最小值以及点的最大值进行检查。

但是,我发现我遇到了问题 - 有时最长的匹配符合最小括号数的标准,但点太多,因此不会提取模式,即使模式内有拉伸确实符合标准。

我不知道如何为此编码!这是一个嵌套模式问题,还是可以将其作为常规字符串处理?

这是我处理此问题的代码部分。

def function_for_processing(file_name, dots, brackets):

    with open(file_name, 'r') as file:
        for line in file:
            if line.startswith('(') or line.startswith('.') or line.startswith(')'):
                # Operate only on lines with bracket/dot annotation
                matches = re.finditer(r'(?:\([\(.]+\()(?:[^()]*\.+[^()]*)\)[\).]+\)', line)
                for match in matches:
                    loop_dots = re.findall(r'(?<=\()([^()]*\.{2}[^()]*)(?=\))', match.group(0))
                    internal_dots_removed = re.sub(r'(?<=\()([^()]*\.{2}[^()]*)(?=\))', '', match.group(0))

                    loop_count = sum([char == '.' for char in str(loop_dots)])

                    dot_count = internal_dots_removed.count(".")
                    opening_count = internal_dots_removed.count("(")
                    closing_count = internal_dots_removed.count(")")
                    if dot_count <= dots and opening_count == closing_count >= brackets and loop_count >= 3:
#Further processing continues...

一个例子:

如果点 = 20 且括号 = 19

这个数据点:

(((((((((....(((.((..(((((.((((((((.(.(((((((((((((((.((((((..((......)))))))).))))))))))).)))).).))))....)))).)).)))....)))))...))).))))))

其中有 51 个左括号、51 个右括号和 31 个点(不包括内部点,不计入最大值)被忽略,但它包含一个想法匹配:

(((((.((((((((.(.(((((((((((((((.((((((..((......)))))))).))))))))))).)))).).))))....)))).)).)))

内部点两侧的括号数量最少,且相等,低于最大点数,并且以括号开始和结束。

python regex string pattern-matching
1个回答
0
投票

如果我正确理解你的问题,你希望修改你的正则表达式模式以找到一行中所需模式的最长延伸,但你还想检查该延伸内的较短子字符串是否满足某些条件。

如果是这种情况,您可能会发现将其分解为多个步骤更容易,而不是尝试使用单个正则表达式来实现此目的。

您可以采取以下方法:

import re
from typing import Tuple

def find_longest_pattern(line: str, max_dots: int) -> Tuple[str, int]:
    """
    Finds the longest pattern in the given line that satisfies the dot count condition.

    Args:
        line (str): The input line containing bracket/dot annotations.
        max_dots (int): The maximum allowed dot count.

    Returns:
        Tuple[str, int]: A tuple containing the longest matching pattern and its dot count.
    """
    # Find all occurrences of the pattern
    matches = re.finditer(r'(?:\([\(.]+\()(?:[^()]*\.+[^()]*)\)[\).]+\)', line)
    
    # Initialize variables to track the longest match and its dot count
    longest_match = ""
    longest_dot_count = 0
    
    # Iterate over matches
    for match in matches:
        loop_dots = re.findall(r'(?<=\()([^()]*\.{2}[^()]*)(?=\))', match.group(0))
        internal_dots_removed = re.sub(r'(?<=\()([^()]*\.{2}[^()]*)(?=\))', '', match.group(0))

        loop_count = sum(char == '.' for char in str(loop_dots))
        dot_count = internal_dots_removed.count(".")
        
        # Check if the current match is longer and satisfies conditions
        if len(match.group(0)) > len(longest_match) and dot_count <= max_dots:
            longest_match = match.group(0)
            longest_dot_count = dot_count
    
    return longest_match, longest_dot_count

def process_file(file_name: str, max_dots: int, min_brackets: int) -> None:
    """
    Processes the given file, finding and processing patterns that meet specified conditions.

    Args:
        file_name (str): The name of the file to be processed.
        max_dots (int): The maximum allowed dot count for a valid pattern.
        min_brackets (int): The minimum required number of opening and closing brackets for a valid pattern.
    """
    with open(file_name, 'r') as file:
        for line in file:
            if line.startswith('(') or line.startswith('.') or line.startswith(')'):
                # Operate only on lines with bracket/dot annotation
                longest_match, dot_count = find_longest_pattern(line, max_dots)

                opening_count = longest_match.count("(")
                closing_count = longest_match.count(")")
                
                # Check if the conditions are met
                if dot_count <= max_dots and opening_count == closing_count >= min_brackets:
                    # Further processing continues...
                    print("Found:", longest_match)

# Example usage
file_name = "your_file.txt"
max_dots = 20
min_brackets = 19
process_file(file_name, max_dots, min_brackets)

这样,

find_longest_pattern
函数专注于查找最长的匹配,然后单独检查该匹配中较短子字符串的条件。这应该可以帮助您更好地控制匹配的逻辑和条件。根据您的具体要求调整
if
声明中的条件。

© www.soinside.com 2019 - 2024. All rights reserved.