我正在用Python阅读一个大文本文件,看起来像下面的内容(包含许多Code
和Description
信息)。
Over-ride Flag for Site/Laterality/Morphology (Interfield Edit 42)
This field is used to identify whether a case was reviewed and coding confirmed
for paired-organ primary
site cases with an in situ behavior and the laterality is not coded right,
left, or one side involved, right or left
origin not specified.
Code Description
Blank Not reviewed, or reviewed and corrected
1 Reviewed and confirmed as reported: A patient had behavior
code of in situ and laterality is not
stated as right: origin of primary; left: origin of primary; or only one side
involved, right or left
origin not specified
This field is used to identify whether a case was reviewed and coding confirmed
for cases with a non-
specific laterality code.
Code Description
Blank1 Not reviewed
11 A patient had laterality
coded non-specifically and
extension coded specifically
This field, new for 2018, indicates whether a case was reviewed and coding
............
从上面的自由文本中,我只需要将代码和描述值存储到如下所示的两个列表中。
code = ["Blank", "1", "Blank1", "11"]
des = ["Not reviewed, or reviewed and corrected", "Reviewed and confirmed as reported: A patient had behavior code of in situ and laterality is not stated as right: origin of primary; left: origin of primary; or only one side involved, right or left origin not specified", "Not reviewed", "A patient had laterality coded non-specifically and extension coded specifically"]
如何在Python中完成?
注:Code
可以包含“空白(或Blank1)”关键字或数字值。有时代码Description
被分成多行。在上面的示例中,我显示了一个Code
和Description
块包含两个代码和两个描述。但是,Code
和Description
块可以包含一个或多个代码和描述。
尝试以下正则表达式方法:
blanks = re.findall(r'\bCode\b.*?\bDescription\s*?(\S+)\s+.*?\r?\n(\d+)\s+.*?(?=\r?\n\r?\n)', inp, flags=re.DOTALL)
print(blanks)
reviews = re.findall(r'\bCode\b.*?\bDescription\s*?\S+\s+(.*?)\r?\n\d+\s+(.*?)(?=\r?\n\r?\n)', inp, flags=re.DOTALL)
此打印:
[('Blank', '1'), ('Blank1', '11')]
[('Not reviewed, or reviewed and corrected\n', 'Reviewed and confirmed as reported: A patient had behavior \ncode of in situ and laterality is not\nstated as right: origin of primary; left: origin of primary; or only one side \ninvolved, right or left\norigin not specified'), ('Not reviewed\n', 'A patient had laterality \ncoded non-specifically and\nextension coded specifically')]
这里的想法是仅匹配并捕获输入文本的Code ... Description ... Blank
部分的各个所需部分。请注意,此答案假设您已将文本读入Python字符串变量。
我们可以用算法/状态机解决这个问题。以下代码在与python脚本相同的目录中打开名为“ datafile.txt”的文件,进行解析并打印结果。该算法的关键是假设每两个字段之间有空行only,并且包含我们要记录的描述字段开头的任何行的代码属性与其描述属性分开三个或更多空格。从我从您的文件片段中可以看出,这些假设永远都是正确的。
index = -1
record = False
description_block = False
codes = []
descriptions = []
with open("datafile.txt", "r") as file:
for line in file:
line = [portion.strip() for portion in line.split(" ") if portion != ""]
if record:
if len(line) == 2:
index += 1
codes.append(line[0])
descriptions.append(line[1])
else:
if line[0]:
description_block = True
if description_block:
if not line[0]:
description_block = False
record = False
continue
else:
descriptions[index] += " "+line[0]
if line[0] == "Code":
record = True
print("codes:", codes)
print("descriptions:", descriptions)
结果:
codes: ['Blank', '1', 'Blank1', '11']
descriptions: ['Not reviewed, or reviewed and corrected', 'Reviewed and confirmed as reported: A patient had behavior code of in situ and laterality is not stated as right: origin of primary; left: origin of primary; or only one side involved, right or left origin not specified', 'Not reviewed', 'A patient had laterality coded non-specifically and extension coded specifically']
在python 3.8.2中测试]
编辑:更新代码以反映注释中提供的整个数据文件。
import re
column_separator = " "
index = -1
record = False
block_exit = False
break_on_newline = False
codes = []
descriptions = []
templine = ""
def add(line):
global index
index += 1
block_exit = False
codes.append(line[0])
descriptions.append(line[1])
with open("test", "r", encoding="utf-8") as file:
while True:
line = file.readline()
if not line:
break
if record:
line = [portion.strip() for portion in line.split(column_separator) if portion != ""]
if len(line) > 1:
add(line)
else:
if block_exit:
record = False
block_exit = False
else:
if line[0]:
descriptions[index] += " "+line[0]
else:
while True:
line = [portion.strip() for portion in file.readline().split(column_separator) if portion != ""]
if not line:
break
if len(line) > 1:
if templine:
descriptions[index] += templine
templine = ""
add(line)
break
else:
print(line)
if line[0] and "Instructions" not in line[0]:
templine += " "+line[0]
else:
if break_on_newline:
break_on_newline = False
record = False
templine = ""
break
else:
templine += " "+line[0]
break_on_newline = True
else:
if line == "Code Description\n":
record = True
print("codes:", codes)
print("\n")
print("descriptions:", descriptions)
# for i in range(len(codes)):
# print(codes[i]+"\t\t", descriptions[i])