如何在Python中从无缩进的字符串中解析和分组分层列表项？

Question

问题陈述

给定一个未缩进的字符串作为输入，执行以下步骤：

识别字符串中层次结构最高级别的列表项。这些顶级项目可以通过以下标准来识别：
- 编号系统（例如，1.、2.、3.）
- 字母系统（例如，A.、B.、C.）
- 项目符号（例如 -、*、•）
- 符号（例如，>、#、§）
对于步骤 1 中确定的每个顶级项目：

a.将其与所有后续较低级别项目分组，直到遇到下一个顶级项目。较低级别的项目可以通过以下标准来识别：
- 前缀（例如，1.1、1.2、1.3）
- 项目符号（例如 -、*、•）
- 字母数字序列（例如，a.、b.、c.）
- 罗马数字（例如，i.、ii.、iii.）
b.将顶级项目与其关联的较低级别项目连接成单个字符串，同时保留原始格式和分隔符。应保留输入字符串中出现的格式和分隔符。
将生成的分组列表项作为 Python 列表返回，其中每个元素代表一个顶级项及其关联的较低级项。列表中的每个元素应该是一个字符串，包含连接的顶级项目及其较低级别的项目。
从输出中排除出现在第一个顶级项目之前和最后一个顶级项目之后的任何文本。只有第一个和最后一个顶级项目之间的内容才应包含在输出列表中。

目标

目标是创建一个 Python 方法，该方法采用不缩进的字符串作为输入，根据指定的条件识别顶级项目及其关联的较低级别项目，将它们连接成每个顶级项目的单个字符串，同时保持原始格式和分隔符，并将生成的分组列表项作为 Python 列表返回。输出列表应与所需的格式匹配，每个元素代表一个顶级项目及其关联的较低级别项目。

请求

请提供有关如何创建可成功实现上述目标的 Python 方法的解释和指导。解释应包括所涉及的步骤、任何必要的数据结构或算法，以及处理不同场景和边缘情况的注意事项。

其他详细信息

我尝试创建一个Python方法来实现上述任务，但我的尝试没有成功。我尝试过的方法不会为给定的输入产生预期的输出。
为了帮助测试和验证解决方案，我在下面创建并包含了大量示例输入及其相应的预期输出。这些测试用例涵盖了各种场景和边缘情况，以确保方法的稳健性。

代码尝试：

尝试1：

def process_list_hierarchy(text):
    # Helper function to determine the indentation level
    def get_indentation_level(line):
        return len(line) - len(line.lstrip())

    # Helper function to parse the input text into a list of lines with their hierarchy levels
    def parse_hierarchy(text):
        lines = text.split('\n')
        hierarchy = []
        for line in lines:
            if line.strip():  # Ignore empty lines
                level = get_indentation_level(line)
                hierarchy.append((level, line.strip()))
        return hierarchy

    # Helper function to build a tree structure from the hierarchy levels
    def build_tree(hierarchy):
        tree = []
        stack = [(-1, tree)]  # Start with a dummy root level
        for level, content in hierarchy:
            # Find the correct parent level
            while stack and stack[-1][0] >= level:
                stack.pop()
            # Create a new node and add it to its parent's children
            node = {'content': content, 'children': []}
            stack[-1][1].append(node)
            stack.append((level, node['children']))
        return tree

    # Helper function to combine the tree into a single list
    def combine_tree(tree, combined_list=[], level=0):
        for node in tree:
            combined_list.append(('  ' * level) + node['content'])
            if node['children']:
                combine_tree(node['children'], combined_list, level + 1)
        return combined_list

    # Parse the input text into a hierarchy
    hierarchy = parse_hierarchy(text)
    # Build a tree structure from the hierarchy
    tree = build_tree(hierarchy)
    # Combine the tree into a single list while maintaining the hierarchy
    combined_list = combine_tree(tree)
    # Return the combined list as a string
    return '\n'.join(combined_list)

尝试2：

 

def organize_hierarchically(items):
    def get_level(item):
        match = re.match(r'^(\d+\.?|\-|\*)', item)
        return len(match.group()) if match else 0

    grouped_items = []
    for level, group in groupby(items, key=get_level):
        if level == 1:
            grouped_items.append('\n'.join(group))
        else:
            grouped_items[-1] += '\n' + '\n'.join(group)

    return grouped_items

尝试3：

from bs4 import BeautifulSoup
import nltk

def extract_sub_objectives(input_text):
    soup = BeautifulSoup(input_text, 'html.parser')
    text_content = soup.get_text()

    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text_content)

    # Initialize an empty list to store the sub-objectives
    sub_objectives = []

    # Iterate through the sentences and extract sub-objectives
    current_sub_objective = ""
    for sentence in sentences:
        if sentence.startswith(("1.", "2.", "3.", "4.")):
            if current_sub_objective:
                sub_objectives.append(current_sub_objective)
                current_sub_objective = ""
            current_sub_objective += sentence + "\n"
        elif current_sub_objective:
            current_sub_objective += sentence + "\n"

    # Append the last sub-objective, if any
    if current_sub_objective:
        sub_objectives.append(current_sub_objective)

    return sub_objectives

尝试4：

def extract_sub_objectives(input_text, preserve_formatting=False):
    # Modified to strip both single and double quotes
    input_text = input_text.strip('\'"')
    messages = []
    messages.append("Debug: Starting to process the input text.")
    # Debug message to show the input text after stripping quotes
    messages.append(f"Debug: Input text after stripping quotes: '{input_text}'")

    # Define possible starting characters for new sub-objectives
    start_chars = [str(i) + '.' for i in range(1, 100)]  # Now includes up to two-digit numbering
    messages.append(f"Debug: Start characters defined: {start_chars}")

    # Define a broader range of continuation characters
    continuation_chars = ['-', '*', '+', '•', '>', '→', '—']  # Expanded list
    messages.append(f"Debug: Continuation characters defined: {continuation_chars}")

    # Replace escaped newline characters with actual newline characters
    input_text = input_text.replace('\\n', '\n')
    # Split the input text into lines
    lines = input_text.split('\n')
    messages.append(f"Debug: Input text split into lines: {lines}")

    # Initialize an empty list to store the sub-objectives
    sub_objectives = []
    # Initialize an empty string to store the current sub-objective
    current_sub_objective = ''
    # Initialize a counter for the number of continuations in the current sub-objective
    continuation_count = 0

    # Function to determine if a line is a new sub-objective
    def is_new_sub_objective(line):
        # Strip away leading quotation marks and whitespace
        line = line.strip('\'"').strip()
        return any(line.startswith(start_char) for start_char in start_chars)

    # Function to determine if a line is a continuation
    def is_continuation(line, prev_line):
        if not prev_line:
            return False
        # Check if the line starts with an alphanumeric followed by a period or parenthesis
        if len(line) > 1 and line[0].isalnum() and (line[1] == '.' or line[1] == ')'):
            # Check if it follows the sequence of the previous line
            if line[0].isdigit() and prev_line[0].isdigit() and int(line[0]) == int(prev_line[0]) + 1:
                return False
            elif line[0].isalpha() and prev_line[0].isalpha() and ord(line[0].lower()) == ord(prev_line[0].lower()) + 1:
                return False
            else:
                return True
        # Add a condition to check for lower-case letters followed by a full stop
        if line[0].islower() and line[1] == '.':
            return True
        return any(line.startswith(continuation_char) for continuation_char in continuation_chars)

    # Iterate over each line
    for i, line in enumerate(lines):
        prev_line = lines[i - 1] if i > 0 else ''
        # Check if the line is a new sub-objective
        if is_new_sub_objective(line):
            messages.append(f"Debug: Found a new sub-objective at line {i + 1}: '{line}'")
            # If we have a current sub-objective, check the continuation count
            if current_sub_objective:
                if continuation_count < 2:
                    messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
                    for message in messages:
                        print(message)
                    return None
                # Check the preserve_formatting parameter before adding
                sub_objectives.append(
                    current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
                messages.append(f"Debug: Added a sub-objective to the list. Current count: {len(sub_objectives)}.")
            # Reset the current sub-objective to the new one and reset the continuation count
            current_sub_objective = line
            continuation_count = 0
        # Check if the line is a continuation
        elif is_continuation(line, prev_line):
            messages.append(f"Debug: Line {i + 1} is a continuation of the previous line: '{line}'")
            # Add the line to the current sub-objective, checking preserve_formatting
            current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()
            # Increment the continuation count
            continuation_count += 1
        # Handle lines that are part of the current sub-objective but don't start with a continuation character
        elif current_sub_objective:
            messages.append(f"Debug: Line {i + 1} is part of the current sub-objective: '{line}'")
            # Add the line to the current sub-objective, checking preserve_formatting
            current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()

    # If we have a current sub-objective, check the continuation count before adding it to the list
    if current_sub_objective:
        if continuation_count < 2:
            messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
            for message in messages:
                print(message)
            return None
        # Check the preserve_formatting parameter before adding
        sub_objectives.append(current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
        messages.append(f"Debug: Added the final sub-objective to the list. Final count: {len(sub_objectives)}.")

    # Print the debug messages if no sub-objectives are found
    if not sub_objectives:
        for message in messages:
            print(message)

    return sub_objectives

样本数据（输入和相关输出）：

https://pastebin.com/s8nWktbZ

Answer 1

根据我的理解，这应该有效：

def parse_list(items):
    def helper(items, level):
        result = []
        i = 0
        while i < len(items):
            item = items[i]
            if item.startswith(' ' * level):
                if '.' in item:
                    key, value = item.split('.', 1)
                    subitems, i = helper(items[i + 1:], level + 1)
                    result.append({key.strip(): value.strip(), 'children': subitems})
                else:
                    result.append({'item': item.strip(), 'children': []})
            else:
                break
            i += 1
        return result, i

    items = [item.strip() for item in items.split('\n') if item.strip()]
    parsed, _ = helper(items, 0)
    return parsed

# Example usage:
unindented_string = """
Item 1
 Subitem 1.1
 Subitem 1.2
Item 2
 Subitem 2.1
 Subitem 2.2
  Subsubitem 2.2.1
  Subsubitem 2.2.2
Item 3
 Subitem 3.1
 Subitem 3.2
"""

parsed_list = parse_list(unindented_string)
print(parsed_list)

如何在Python中从无缩进的字符串中解析和分组分层列表项？

问题描述投票：0回答：1

问题陈述

目标

请求

其他详细信息

代码尝试：

样本数据（输入和相关输出）：

1个回答

最新问题

如何在Python中从无缩进的字符串中解析和分组分层列表项？

问题描述 投票：0回答：1

问题陈述

目标

请求

其他详细信息

代码尝试：

样本数据（输入和相关输出）：

1个回答

最新问题

问题描述投票：0回答：1