我正在尝试将新行分隔的文本文件解析为行块,这些行附加到.txt文件。我希望能够在结束字符串之后抓取x行数,因为这些行的内容会有所不同,这意味着设置'结束字符串'以尝试匹配它会错过行。
文件示例:
"Start"
"..."
"..."
"..."
"..."
"---" ##End here
"xxx" ##Unique data here
"xxx" ##And here
这是代码
first = "Start"
first_end = "---"
with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
copy = False
for line in infile:
if line.strip().startswith(first):
copy = True
outfile.write(line)
elif line.strip().startswith(first_end):
copy = False
outfile.write(line)
##Want to also write next 2 lines here
elif copy:
outfile.write(line)
有没有办法使用for line in infile
这样做,还是我需要使用不同类型的循环?
您可以使用next
或readline
(在Python 3及更高版本中)检索文件中的下一行:
elif line.strip().startswith(first_end):
copy = False
outfile.write(line)
outfile.write(next(infile))
outfile.write(next(infile))
要么
#note: not compatible with Python 2.7 and below
elif line.strip().startswith(first_end):
copy = False
outfile.write(line)
outfile.write(infile.readline())
outfile.write(infile.readline())
这也会导致文件指针前进两行,因此for line in infile:
的下一次迭代将跳过你用readline
读取的两行。
奖励术语nitpick:文件对象不是列表,访问列表的第x + 1个元素的方法可能不适用于访问文件的下一行,反之亦然。如果您确实想要访问正确列表对象的下一项,则可以使用enumerate
,这样您就可以对列表的索引执行算术运算。例如:
seq = ["foo", "bar", "baz", "qux", "troz", "zort"]
#find all instances of "baz" and also the first two elements after "baz"
for idx, item in enumerate(seq):
if item == "baz":
print(item)
print(seq[idx+1])
print(seq[idx+2])
请注意,与readline
不同,索引不会推进迭代器,因此for idx, item in enumerate(seq):
仍会迭代“qux”和“troz”。
适用于任何迭代的方法是使用附加变量来跟踪迭代中的状态。这样做的好处是你不必知道如何手动推进迭代;缺点是推理循环内的逻辑更加困难,因为它暴露了额外的副作用。
first = "Start"
first_end = "---"
with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
copy = False
num_items_to_write = 0
for line in infile:
if num_items_to_write > 0:
outfile.write(line)
num_items_to_write -= 1
elif line.strip().startswith(first):
copy = True
outfile.write(line)
elif line.strip().startswith(first_end):
copy = False
outfile.write(line)
num_items_to_write = 2
elif copy:
outfile.write(line)
在从分隔文件中提取重复数据组的特定情况下,完全跳过迭代并使用正则表达式可能是合适的。对于像您这样的数据,可能看起来像:
import re
with open("testlog.log") as file:
data = file.read()
pattern = re.compile(r"""
^Start$ #"Start" by itself on a line
(?:\n.*$)*? #zero or more lines, matched non-greedily
#use (?:) for all groups so `findall` doesn't capture them later
\n---$ #"---" by itself on a line
(?:\n.*$){2} #exactly two lines
""", re.MULTILINE | re.VERBOSE)
#equivalent one-line regex:
#pattern = re.compile("^Start$(?:\n.*$)*?\n---$(?:\n.*$){2}", re.MULTILINE)
for group in pattern.findall(data):
print("Found group:")
print(group)
print("End of group.\n\n")
在日志上运行时看起来像:
Start
foo
bar
baz
qux
---
troz
zort
alice
bob
carol
dave
Start
Fred
Barney
---
Wilma
Betty
Pebbles
...这将产生输出:
Found group:
Start
foo
bar
baz
qux
---
troz
zort
End of group.
Found group:
Start
Fred
Barney
---
Wilma
Betty
End of group.
最简单的方法是使生成器函数解析infile:
def read_file(file_handle, start_line, end_line, extra_lines=2):
start = False
while True:
try:
line = next(file_handle)
except StopIteration:
return
if not start and line.strip().startswith(start_line):
start = True
yield line
elif not start:
continue
elif line.strip().startswith(end_line):
yield line
try:
for _ in range(extra_lines):
yield next(file_handle)
except StopIteration:
return
else:
yield line
如果您知道每个文件格式正确,则不需要try-except
子句。
您可以像这样使用此生成器:
if __name__ == "__main__":
first = "Start"
first_end = "---"
with open("testlog.log") as infile, open("parsed.txt", "a") as outfile:
output = read_file(
file_handle=infile,
start_line=first,
end_line=first_end,
extra_lines=1,
)
outfile.writelines(output)
@Kevin的变体回答了3状态变量和更少的代码重复。
first = "Start"
first_end = "---"
# Lines to read after end flag
extra_count = 2
with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
# Do no copy by default
copy = 0
for line in infile:
# Strip once only
clean_line = line.strip()
# Enter "infinite copy" state
if clean_line.startswith(first):
copy = -1
# Copy next line and extra amount
elif clean_line.startswith(first_end):
copy = extra_count + 1
# If in a "must-copy" state
if copy != 0:
# One less line to copy if end flag passed
if copy > 0:
copy -= 1
# Copy current line
outfile.write(line)