从 Python 文本字符串中提取漫画书系列和问题包

问题描述 投票:0回答:1

我有一串文本,其中包括一系列漫画系列,并且每行都在一个长块中。

示例如下:

示例1

“蝙蝠侠 #323、325、335、340、368-369、397-400、超凡蜘蛛侠 #13-17”

示例2

“《超凡蜘蛛侠》#nn、《超凡蜘蛛侠年度》#10、《超凡蜘蛛侠》174、185、213、245、326”

我想指出,“#nn”应该保留为漫画中的系列。如果这样更容易,我可以将“#nn”替换为“#00”。

我一直在尝试在Python中使用正则表达式(或regex)。例如,我尝试过

r"([a-zA-Z\s\'-]+) #(\d+|\d+-\d+|\w+)" 

我写的代码如下

import re

def separate_comic_books(comic_books_str):
    series_issue_dict = {}

    # Define a regular expression pattern to extract series and issue information
    pattern = re.compile(r'([a-zA-Z\s\'-]+) #(\d+|\d+-\d+|\w+)')

    # Split the string into individual comic book entries
    comic_books_list = re.split(',\s*', comic_books_str)

    # Iterate through the list of comic books
    for comic_book in comic_books_list:
        matches = pattern.findall(comic_book)
        print(matches)

        for match in matches:
            series = match[0].strip()
            issues = match[1].strip()

            # Split the issues if it's a range
            issues_list = [str(i) for i in range(int(issues.split('-')[0]), int(issues.split('-')[-1]) + 1)]

            # Add the comic book to the dictionary based on series
            if series in series_issue_dict:
                series_issue_dict[series].extend(issues_list)
            else:
                series_issue_dict[series] = issues_list

    # Create the final formatted string
    formatted_comic_books = []
    for series, issues in series_issue_dict.items():
        formatted_issues = ', '.join([f"{series} #{issue}" for issue in sorted(issues)])
        formatted_comic_books.append(formatted_issues)

    return ', '.join(formatted_comic_books)

# Provided string of comic books
comic_books_str = "Amazing Spider-Man #nn, Amazing Spider-Man Annual #10, Amazing Spider-Man 174, 185, 213, 245, 326"

result = separate_comic_books(comic_books_str)
print(result)

但是,我得到以下结果

示例1

"Batman #323, Amazing Spider-Man #13"

示例2

ValueError: invalid literal for int() with base 10: 'nn'

但是,我想得到以下结果

示例1

Batman #323, Batman #325, Batman #335, Batman #340, Batman #368, Batman #369, Batman #397, Batman #398, Batman #399, Batman #400, Amazing Spider-Man #13, Amazing Spider-Man #14, Amazing Spider-Man #15, Amazing Spider-Man #16, Amazing Spider-Man #17

示例2

Amazing Spider-Man #nn, Amazing Spider-Man Annual #10, Amazing Spider-Man 174, Amazing Spider-Man 185, Amazing Spider-Man 213, Amazing Spider-Man 245, Amazing Spider-Man 326

有没有办法编写Python代码来做到这一点?

非常感谢!!

python regex python-re
1个回答
0
投票

我这里没有使用正则表达式,只是简单的 split()。整个班级的事情可能有点过头了,但是嘿我也在练习!

from collections import defaultdict
from itertools import chain


class Comics:

    def __init__(self, comic_list: list):
        self.comic_list = comic_list
        self.comic_dictionary = defaultdict(list)
        self._generate_comic_dictionary()

    def _generate_comic_dictionary(self):
        current_title = ''
        for comic in self.comic_list.split(','):
            comic = comic.strip()
            if '#' in comic:
                title, issue = comic.split('#')
                current_title = title
            else:
                issue = comic
            self._add_comics_to_dictionary(current_title, issue)

    def _add_comics_to_dictionary(self, title: str, issue: str):
        title = title.strip()
        issue = issue.strip()
        if '-' in issue:
            start, end = issue.split('-')
            self.comic_dictionary[title].extend([str(i) for i in range(int(start), int(end) + 1)])
        else:
            self.comic_dictionary[title].append(issue)

    def get_comics_dictionary(self):
        return self.comic_dictionary

    def __str__(self):
        return_list = [[title + ' #' + issue for issue in issues] for title, issues in self.comic_dictionary.items()]
        return ', '.join(list(chain(*return_list)))

输出

comics_list = 'Batman #nn, 325, 335, 340, 368-369, 397-400, Amazing Spider-Man #13-17'
comics = Comics(comics_list)
print(comics)
Batman #nn, Batman #325, Batman #335, Batman #340, Batman #368, Batman #369, Batman #397, Batman #398, Batman #399, Batman #400, Amazing Spider-Man #13, Amazing Spider-Man #14, Amazing Spider-Man #15, Amazing Spider-Man #16, Amazing Spider-Man #17

Process finished with exit code 0
© www.soinside.com 2019 - 2024. All rights reserved.