我有下表,这是task-spooler的输出。
人类很容易解析,但我在将其读入 pandas DF 时遇到困难。
有什么想法吗?
ID State Output E-Level Times(r/u/s) Command [run=1/2]
6 running /tmp/ts-out.FzVneG [l1]python infloop.py
0 finished /tmp/ts-out.ixWHm2 0 0.00/0.00/0.00 bash -c echo 1
1 finished /tmp/ts-out.ZzwS11 0 0.00/0.00/0.00 bash -c echo 1
2 finished /tmp/ts-out.GJlyge 2 0.00/0.00/0.00 bash -c
4 finished /tmp/ts-out.lIVMYH 2 0.00/0.00/0.00 bash -c -h
5 finished /tmp/ts-out.8EKHy1 -1 141.23/0.00/0.00 python infloop.py
3 finished /tmp/ts-out.lBr4Wy -1 2545.36/0.00/0.02 bash -c python infloop.py
7 finished /tmp/ts-out.kxCczi 2 0.01/0.00/0.00 bash -c
8 finished /tmp/ts-out.3VkfNh 0 0.00/0.00/0.00 echo
9 finished /tmp/ts-out.8ewxzl 0 0.01/0.00/0.00 echo
10 finished /tmp/ts-out.ahSLaY 0 0.00/0.00/0.00 bash -c echo $GPUID
11 finished /a/home/cc/cs/yuvval/tmp/ts-out.3dpaBO 0 0.00/0.00/0.00 bash -c ls
12 finished /tmp/ts-out.ADWkve 0 0.00/0.00/0.00 bash -c ls
13 finished /a/home/cc/cs/yuvval/tmp/ts-out.xm0jtn -1 130.67/0.00/0.02 bash -c python infloop.py
14 finished /tmp/ts-out.HxBqkm 0 0.00/0.00/0.00 bash -c echo 11
15 finished /tmp/ts-out.ERNuaE 0 0.00/0.00/0.00 bash -c echo
16 finished /tmp/ts-out.9j6hkS 0 0.00/0.00/0.00 bash -c echo $GPUID
17 finished /tmp/ts-out.Y5QDNa 0 0.00/0.00/0.00 bash -c echo $GPUID
18 finished /tmp/ts-out.EIHhoX -1 0.00/0.00/0.00 %s
19 finished /tmp/ts-out.LLw2Wl -1 0.00/0.00/0.00
20 finished /tmp/ts-out.deWAJR -1 0.01/0.00/0.00 echo $GPUID
21 finished /tmp/ts-out.AdZFIf -1 0.00/0.00/0.00 echo 12
22 finished /tmp/ts-out.NBOCVv 0 0.00/0.00/0.00 echo 12
23 finished /tmp/ts-out.5WpfPu 0 0.00/0.00/0.00 echo
24 finished /tmp/ts-out.1lw4bS -1 0.00/0.00/0.00 echo
25 finished /tmp/ts-out.7MNGLQ 0 0.00/0.00/0.00 bash -c echo $GPUID
26 finished /tmp/ts-out.8FZ3on 0 0.00/0.00/0.00 bash -c echo $GPUID
我最好的尝试是:
from StringIO import StringIO as sIO
std = ... # the table text
pd.read_table(sIO(std), sep='\s+', engine='python')
错误:
ValueError:第 2 行中预期有 7 个字段,但看到了 9 个
生成表格的源代码可用。以下是生成每一行的命令。这可以帮助将表读取到数据框吗?
if (p->label)
snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s[%s]"
"%s\n",
p->jobid,
jobstate,
output_filename,
p->result.errorlevel,
p->result.real_ms,
p->result.user_ms,
p->result.system_ms,
dependstr,
p->label,
p->command);
else
snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s%s\n",
p->jobid,
jobstate,
output_filename,
p->result.errorlevel,
p->result.real_ms,
p->result.user_ms,
p->result.system_ms,
dependstr,
p->command);
这有点烦人,但由于输出日志中的分隔符不一致(有时是多个空格,有时是制表符,最后一列通常只有一个空格),在用 pandas 解析文件之前,如果不向文件应用任何额外的逻辑,就很难解析。 我个人不喜欢在 python 中打开文件来修复它,然后用 pandas 加载它,所以我只需在 python 中加载文件之前向我的管道添加一个简短的
sed
命令(如果您是使用 Linux 并且日志文本是从文件加载的)。
您可以添加:
cat logfile.log | sed -r 's/\s\s+/,/g' | sed -e 's/\([[:digit:]].[[:digit:]]\{2\}\) /\1,/' > logfile.csv
然后您只需用逗号替换所有空格以及最后一个有问题的空格即可。 文字则从:
ID State Output E-Level Times(r/u/s) Command [run=1/2]
6 running /tmp/ts-out.FzVneG [l1]python infloop.py
0 finished /tmp/ts-out.ixWHm2 0 0.00/0.00/0.00 bash -c echo 1
1 finished /tmp/ts-out.ZzwS11 0 0.00/0.00/0.00 bash -c echo 1
2 finished /tmp/ts-out.GJlyge 2 0.00/0.00/0.00 bash -c
4 finished /tmp/ts-out.lIVMYH 2 0.00/0.00/0.00 bash -c -h
5 finished /tmp/ts-out.8EKHy1 -1 141.23/0.00/0.00 python infloop.py
3 finished /tmp/ts-out.lBr4Wy -1 2545.36/0.00/0.02 bash -c python infloop.py
7 finished /tmp/ts-out.kxCczi 2 0.01/0.00/0.00 bash -c
8 finished /tmp/ts-out.3VkfNh 0 0.00/0.00/0.00 echo
对此:
ID,State,Output,E-Level,Times(r/u/s),Command [run=1/2]
6,running,/tmp/ts-out.FzVneG,[l1]python infloop.py
0,finished,/tmp/ts-out.ixWHm2,0,0.00/0.00/0.00,bash -c echo 1
1,finished,/tmp/ts-out.ZzwS11,0,0.00/0.00/0.00,bash -c echo 1
2,finished,/tmp/ts-out.GJlyge,2,0.00/0.00/0.00,bash -c
4,finished,/tmp/ts-out.lIVMYH,2,0.00/0.00/0.00,bash -c -h
5,finished,/tmp/ts-out.8EKHy1,-1,141.23/0.00/0.00,python infloop.py
3,finished,/tmp/ts-out.lBr4Wy,-1,2545.36/0.00/0.02,bash -c python infloop.py
7,finished,/tmp/ts-out.kxCczi,2,0.01/0.00/0.00,bash -c
8,finished,/tmp/ts-out.3VkfNh,0,0.00/0.00/0.00,echo
然后将其作为 CSV 加载到 pandas 中:
import pandas as pd
my_df = pd.read_csv(my_log_file)
我很抱歉,这不是一个有趣的纯 python 解决方案,但在我看来,bash 部分使 python 部分变得更容易。