我想提取事件对(由
+
和 -
标记的开始和结束)。但这些对可能不匹配,这意味着开始发生两次,然后是结束事件。
在下面的示例中,事件
B
开始发生了 2 次,所以我希望它在未找到结束事件时输出与 nil
不匹配的对。
import re
import pandas as pd
data = """
00:00:00 +running A
00:00:01 -running
00:00:02 +running B
00:00:03 +running B
00:00:04 -running
00:00:05 +running C
00:00:06 -running
00:00:07 +running D
10:00:08 -running
"""
m = re.findall(r"(\d+:\d+:\d+) \+running (\w+).*?(\d+:\d+:\d+) \-running",data,re.DOTALL)
print(len(m))
df = pd.DataFrame(m,columns=['ts1','name','ts2'])
print(df)
电流输出:
ts1 name ts2
0 00:00:00 A 00:00:01
1 00:00:02 B 00:00:04
2 00:00:05 C 00:00:06
3 00:00:07 D 10:00:08
预期:
ts1 name ts2
0 00:00:00 A 00:00:01
1 00:00:02 B NA
2 00:00:03 B 00:00:04
3 00:00:05 C 00:00:06
4 00:00:07 D 10:00:08
在 python 中获得这样的结果的正确方法是什么?我不关心是否使用
findall
。
尝试:
import pandas as pd
data = """
00:00:00 +running A
00:00:01 -running
00:00:02 +running B
00:00:03 +running B
00:00:04 -running
00:00:05 +running C
00:00:06 -running
00:00:07 +running D
10:00:08 -running
"""
def get_columns(data):
stack = []
for time, val in data:
if val.startswith("+"):
stack.append((time, val))
elif val.startswith("-") and stack:
t, v = stack.pop()
yield t, v, time
for time, val in stack:
yield time, val, None
all_data = []
for line in map(str.strip, data.splitlines()):
if line == "":
continue
all_data.append(line.split(maxsplit=1))
df = pd.DataFrame(get_columns(all_data), columns=["ts1", "name", "ts2"]).sort_values(
"ts1"
)
print(df)
打印:
ts1 name ts2
0 00:00:00 +running A 00:00:01
4 00:00:02 +running B None
1 00:00:03 +running B 00:00:04
2 00:00:05 +running C 00:00:06
3 00:00:07 +running D 10:00:08