在Python中使用'for x in list'访问x + 1元素

问题描述 投票:2回答:3

我正在尝试将新行分隔的文本文件解析为行块,这些行附加到.txt文件。我希望能够在结束字符串之后抓取x行数,因为这些行的内容会有所不同,这意味着设置'结束字符串'以尝试匹配它会错过行。

文件示例:

"Start"
"..."
"..."
"..."
"..."
"---" ##End here
"xxx" ##Unique data here
"xxx" ##And here

这是代码

first = "Start"
first_end = "---"

with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
    copy = False
    for line in infile:
        if line.strip().startswith(first):
            copy = True
            outfile.write(line)
        elif line.strip().startswith(first_end):
            copy = False
            outfile.write(line)
            ##Want to also write next 2 lines here
        elif copy:
            outfile.write(line)

有没有办法使用for line in infile这样做,还是我需要使用不同类型的循环?

python
3个回答
5
投票

您可以使用nextreadline(在Python 3及更高版本中)检索文件中的下一行:

    elif line.strip().startswith(first_end):
        copy = False
        outfile.write(line)
        outfile.write(next(infile))
        outfile.write(next(infile))

要么

    #note: not compatible with Python 2.7 and below
    elif line.strip().startswith(first_end):
        copy = False
        outfile.write(line)
        outfile.write(infile.readline())
        outfile.write(infile.readline())

这也会导致文件指针前进两行,因此for line in infile:的下一次迭代将跳过你用readline读取的两行。


奖励术语nitpick:文件对象不是列表,访问列表的第x + 1个元素的方法可能不适用于访问文件的下一行,反之亦然。如果您确实想要访问正确列表对象的下一项,则可以使用enumerate,这样您就可以对列表的索引执行算术运算。例如:

seq = ["foo", "bar", "baz", "qux", "troz", "zort"]

#find all instances of "baz" and also the first two elements after "baz"
for idx, item in enumerate(seq):
    if item == "baz":
        print(item)
        print(seq[idx+1])
        print(seq[idx+2])

请注意,与readline不同,索引不会推进迭代器,因此for idx, item in enumerate(seq):仍会迭代“qux”和“troz”。


适用于任何迭代的方法是使用附加变量来跟踪迭代中的状态。这样做的好处是你不必知道如何手动推进迭代;缺点是推理循环内的逻辑更加困难,因为它暴露了额外的副作用。

first = "Start"
first_end = "---"

with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
    copy = False
    num_items_to_write = 0
    for line in infile:
        if num_items_to_write > 0:
            outfile.write(line)
            num_items_to_write -= 1
        elif line.strip().startswith(first):
            copy = True
            outfile.write(line)
        elif line.strip().startswith(first_end):
            copy = False
            outfile.write(line)
            num_items_to_write = 2
        elif copy:
            outfile.write(line)

在从分隔文件中提取重复数据组的特定情况下,完全跳过迭代并使用正则表达式可能是合适的。对于像您这样的数据,可能看起来像:

import re

with open("testlog.log") as file:
    data = file.read()

pattern = re.compile(r"""
^Start$                 #"Start" by itself on a line
(?:\n.*$)*?             #zero or more lines, matched non-greedily
                        #use (?:) for all groups so `findall` doesn't capture them later
\n---$                  #"---" by itself on a line
(?:\n.*$){2}            #exactly two lines
""", re.MULTILINE | re.VERBOSE)

#equivalent one-line regex:
#pattern = re.compile("^Start$(?:\n.*$)*?\n---$(?:\n.*$){2}", re.MULTILINE)

for group in pattern.findall(data):
    print("Found group:")
    print(group)
    print("End of group.\n\n")

在日志上运行时看起来像:

Start
foo
bar
baz
qux
---
troz
zort
alice
bob
carol
dave
Start
Fred
Barney
---
Wilma
Betty
Pebbles

...这将产生输出:

Found group:
Start
foo
bar
baz
qux
---
troz
zort
End of group.


Found group:
Start
Fred
Barney
---
Wilma
Betty
End of group.

2
投票

最简单的方法是使生成器函数解析infile:

def read_file(file_handle, start_line, end_line, extra_lines=2):
    start = False
    while True:
        try:
            line = next(file_handle)
        except StopIteration:
            return

        if not start and line.strip().startswith(start_line):
            start = True
            yield line
        elif not start:
            continue
        elif line.strip().startswith(end_line):
            yield line
            try:
                for _ in range(extra_lines):
                    yield next(file_handle)
            except StopIteration:
                return
        else:
            yield line

如果您知道每个文件格式正确,则不需要try-except子句。

您可以像这样使用此生成器:

if __name__ == "__main__":
    first = "Start"
    first_end = "---"

    with open("testlog.log") as infile, open("parsed.txt", "a") as outfile:
        output = read_file(
            file_handle=infile,
            start_line=first,
            end_line=first_end,
            extra_lines=1,
        )
        outfile.writelines(output)

1
投票

@Kevin的变体回答了3状态变量和更少的代码重复。

first = "Start"
first_end = "---"
# Lines to read after end flag
extra_count = 2

with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
    # Do no copy by default
    copy = 0

    for line in infile:
        # Strip once only
        clean_line = line.strip()

        # Enter "infinite copy" state
        if clean_line.startswith(first):
            copy = -1

        # Copy next line and extra amount
        elif clean_line.startswith(first_end):
            copy = extra_count + 1

        # If in a "must-copy" state
        if copy != 0:
            # One less line to copy if end flag passed
            if copy > 0:
                copy -= 1
            # Copy current line
            outfile.write(line)
© www.soinside.com 2019 - 2024. All rights reserved.