我有一串文本,其中包括一系列漫画系列,并且每行都在一个长块中。
示例如下:
示例1
“蝙蝠侠 #323、325、335、340、368-369、397-400、超凡蜘蛛侠 #13-17”
示例2
“《超凡蜘蛛侠》#nn、《超凡蜘蛛侠年度》#10、《超凡蜘蛛侠》174、185、213、245、326”
我想指出,“#nn”应该保留为漫画中的系列。如果这样更容易,我可以将“#nn”替换为“#00”。
我一直在尝试在Python中使用正则表达式(或regex)。例如,我尝试过
r"([a-zA-Z\s\'-]+) #(\d+|\d+-\d+|\w+)"
我写的代码如下
import re
def separate_comic_books(comic_books_str):
series_issue_dict = {}
# Define a regular expression pattern to extract series and issue information
pattern = re.compile(r'([a-zA-Z\s\'-]+) #(\d+|\d+-\d+|\w+)')
# Split the string into individual comic book entries
comic_books_list = re.split(',\s*', comic_books_str)
# Iterate through the list of comic books
for comic_book in comic_books_list:
matches = pattern.findall(comic_book)
print(matches)
for match in matches:
series = match[0].strip()
issues = match[1].strip()
# Split the issues if it's a range
issues_list = [str(i) for i in range(int(issues.split('-')[0]), int(issues.split('-')[-1]) + 1)]
# Add the comic book to the dictionary based on series
if series in series_issue_dict:
series_issue_dict[series].extend(issues_list)
else:
series_issue_dict[series] = issues_list
# Create the final formatted string
formatted_comic_books = []
for series, issues in series_issue_dict.items():
formatted_issues = ', '.join([f"{series} #{issue}" for issue in sorted(issues)])
formatted_comic_books.append(formatted_issues)
return ', '.join(formatted_comic_books)
# Provided string of comic books
comic_books_str = "Amazing Spider-Man #nn, Amazing Spider-Man Annual #10, Amazing Spider-Man 174, 185, 213, 245, 326"
result = separate_comic_books(comic_books_str)
print(result)
但是,我得到以下结果
示例1
"Batman #323, Amazing Spider-Man #13"
示例2
ValueError: invalid literal for int() with base 10: 'nn'
但是,我想得到以下结果
示例1
Batman #323, Batman #325, Batman #335, Batman #340, Batman #368, Batman #369, Batman #397, Batman #398, Batman #399, Batman #400, Amazing Spider-Man #13, Amazing Spider-Man #14, Amazing Spider-Man #15, Amazing Spider-Man #16, Amazing Spider-Man #17
示例2
Amazing Spider-Man #nn, Amazing Spider-Man Annual #10, Amazing Spider-Man 174, Amazing Spider-Man 185, Amazing Spider-Man 213, Amazing Spider-Man 245, Amazing Spider-Man 326
有没有办法编写Python代码来做到这一点?
非常感谢!!
我这里没有使用正则表达式,只是简单的 split()。整个班级的事情可能有点过头了,但是嘿我也在练习!
from collections import defaultdict
from itertools import chain
class Comics:
def __init__(self, comic_list: list):
self.comic_list = comic_list
self.comic_dictionary = defaultdict(list)
self._generate_comic_dictionary()
def _generate_comic_dictionary(self):
current_title = ''
for comic in self.comic_list.split(','):
comic = comic.strip()
if '#' in comic:
title, issue = comic.split('#')
current_title = title
else:
issue = comic
self._add_comics_to_dictionary(current_title, issue)
def _add_comics_to_dictionary(self, title: str, issue: str):
title = title.strip()
issue = issue.strip()
if '-' in issue:
start, end = issue.split('-')
self.comic_dictionary[title].extend([str(i) for i in range(int(start), int(end) + 1)])
else:
self.comic_dictionary[title].append(issue)
def get_comics_dictionary(self):
return self.comic_dictionary
def __str__(self):
return_list = [[title + ' #' + issue for issue in issues] for title, issues in self.comic_dictionary.items()]
return ', '.join(list(chain(*return_list)))
输出
comics_list = 'Batman #nn, 325, 335, 340, 368-369, 397-400, Amazing Spider-Man #13-17'
comics = Comics(comics_list)
print(comics)
Batman #nn, Batman #325, Batman #335, Batman #340, Batman #368, Batman #369, Batman #397, Batman #398, Batman #399, Batman #400, Amazing Spider-Man #13, Amazing Spider-Man #14, Amazing Spider-Man #15, Amazing Spider-Man #16, Amazing Spider-Man #17
Process finished with exit code 0