从多行文本中提取事件对

问题描述 投票:0回答:1

我想提取事件对(由

+
-
标记的开始和结束)。但这些对可能不匹配,这意味着开始发生两次,然后是结束事件。

在下面的示例中,事件

B
开始发生了 2 次,所以我希望它在未找到结束事件时输出与
nil
不匹配的对。

import re
import pandas as pd

data = """
00:00:00 +running A
00:00:01 -running
00:00:02 +running B
00:00:03 +running B
00:00:04 -running
00:00:05 +running C
00:00:06 -running
00:00:07 +running D
10:00:08 -running

"""
m = re.findall(r"(\d+:\d+:\d+) \+running (\w+).*?(\d+:\d+:\d+) \-running",data,re.DOTALL)
print(len(m))
df = pd.DataFrame(m,columns=['ts1','name','ts2'])
print(df)

电流输出:

        ts1 name       ts2
0  00:00:00    A  00:00:01
1  00:00:02    B  00:00:04
2  00:00:05    C  00:00:06
3  00:00:07    D  10:00:08

预期:

        ts1 name       ts2
0  00:00:00    A  00:00:01
1  00:00:02    B  NA
2  00:00:03    B  00:00:04
3  00:00:05    C  00:00:06
4  00:00:07    D  10:00:08

在 python 中获得这样的结果的正确方法是什么?我不关心是否使用

findall

python pandas
1个回答
0
投票

尝试:

import pandas as pd

data = """
00:00:00 +running A
00:00:01 -running
00:00:02 +running B
00:00:03 +running B
00:00:04 -running
00:00:05 +running C
00:00:06 -running
00:00:07 +running D
10:00:08 -running

"""


def get_columns(data):
    stack = []
    for time, val in data:
        if val.startswith("+"):
            stack.append((time, val))
        elif val.startswith("-") and stack:
            t, v = stack.pop()
            yield t, v, time

    for time, val in stack:
        yield time, val, None


all_data = []
for line in map(str.strip, data.splitlines()):
    if line == "":
        continue
    all_data.append(line.split(maxsplit=1))

df = pd.DataFrame(get_columns(all_data), columns=["ts1", "name", "ts2"]).sort_values(
    "ts1"
)
print(df)

打印:

        ts1        name       ts2
0  00:00:00  +running A  00:00:01
4  00:00:02  +running B      None
1  00:00:03  +running B  00:00:04
2  00:00:05  +running C  00:00:06
3  00:00:07  +running D  10:00:08
© www.soinside.com 2019 - 2024. All rights reserved.