用 pandas 读取复杂的表('任务后台处理程序')

问题描述 投票:0回答:1

我有下表,这是task-spooler的输出。

人类很容易解析,但我在将其读入 pandas DF 时遇到困难。

有什么想法吗?

ID   State      Output               E-Level  Times(r/u/s)   Command [run=1/2]
6    running    /tmp/ts-out.FzVneG                           [l1]python infloop.py
0    finished   /tmp/ts-out.ixWHm2   0        0.00/0.00/0.00 bash -c echo 1
1    finished   /tmp/ts-out.ZzwS11   0        0.00/0.00/0.00 bash -c echo 1
2    finished   /tmp/ts-out.GJlyge   2        0.00/0.00/0.00 bash -c
4    finished   /tmp/ts-out.lIVMYH   2        0.00/0.00/0.00 bash -c -h
5    finished   /tmp/ts-out.8EKHy1   -1       141.23/0.00/0.00 python infloop.py
3    finished   /tmp/ts-out.lBr4Wy   -1       2545.36/0.00/0.02 bash -c python infloop.py
7    finished   /tmp/ts-out.kxCczi   2        0.01/0.00/0.00 bash -c
8    finished   /tmp/ts-out.3VkfNh   0        0.00/0.00/0.00 echo
9    finished   /tmp/ts-out.8ewxzl   0        0.01/0.00/0.00 echo
10   finished   /tmp/ts-out.ahSLaY   0        0.00/0.00/0.00 bash -c echo $GPUID
11   finished   /a/home/cc/cs/yuvval/tmp/ts-out.3dpaBO 0        0.00/0.00/0.00 bash -c ls
12   finished   /tmp/ts-out.ADWkve   0        0.00/0.00/0.00 bash -c ls
13   finished   /a/home/cc/cs/yuvval/tmp/ts-out.xm0jtn -1       130.67/0.00/0.02 bash -c python infloop.py
14   finished   /tmp/ts-out.HxBqkm   0        0.00/0.00/0.00 bash -c echo 11
15   finished   /tmp/ts-out.ERNuaE   0        0.00/0.00/0.00 bash -c echo 
16   finished   /tmp/ts-out.9j6hkS   0        0.00/0.00/0.00 bash -c echo $GPUID
17   finished   /tmp/ts-out.Y5QDNa   0        0.00/0.00/0.00 bash -c echo $GPUID
18   finished   /tmp/ts-out.EIHhoX   -1       0.00/0.00/0.00 %s
19   finished   /tmp/ts-out.LLw2Wl   -1       0.00/0.00/0.00 
20   finished   /tmp/ts-out.deWAJR   -1       0.01/0.00/0.00 echo $GPUID
21   finished   /tmp/ts-out.AdZFIf   -1       0.00/0.00/0.00 echo 12
22   finished   /tmp/ts-out.NBOCVv   0        0.00/0.00/0.00 echo 12
23   finished   /tmp/ts-out.5WpfPu   0        0.00/0.00/0.00 echo
24   finished   /tmp/ts-out.1lw4bS   -1       0.00/0.00/0.00 echo 
25   finished   /tmp/ts-out.7MNGLQ   0        0.00/0.00/0.00 bash -c echo $GPUID
26   finished   /tmp/ts-out.8FZ3on   0        0.00/0.00/0.00 bash -c echo $GPUID

我最好的尝试是:

from StringIO import StringIO as sIO
std = ... # the table text
pd.read_table(sIO(std), sep='\s+', engine='python')

错误:

ValueError:第 2 行中预期有 7 个字段,但看到了 9 个

生成表格的源代码可用。以下是生成每一行的命令。这可以帮助将表读取到数据框吗?

if (p->label)
    snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s[%s]"
            "%s\n",
            p->jobid,
            jobstate,
            output_filename,
            p->result.errorlevel,
            p->result.real_ms,
            p->result.user_ms,
            p->result.system_ms,
            dependstr,
            p->label,
            p->command);
else
    snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s%s\n",
            p->jobid,
            jobstate,
            output_filename,
            p->result.errorlevel,
            p->result.real_ms,
            p->result.user_ms,
            p->result.system_ms,
            dependstr,
            p->command);
python pandas scheduled-tasks
1个回答
0
投票

这有点烦人,但由于输出日志中的分隔符不一致(有时是多个空格,有时是制表符,最后一列通常只有一个空格),在用 pandas 解析文件之前,如果不向文件应用任何额外的逻辑,就很难解析。 我个人不喜欢在 python 中打开文件来修复它,然后用 pandas 加载它,所以我只需在 python 中加载文件之前向我的管道添加一个简短的

sed
命令(如果您是使用 Linux 并且日志文本是从文件加载的)。 您可以添加:

cat logfile.log | sed -r 's/\s\s+/,/g' | sed -e 's/\([[:digit:]].[[:digit:]]\{2\}\) /\1,/' > logfile.csv

然后您只需用逗号替换所有空格以及最后一个有问题的空格即可。 文字则从:

ID   State      Output               E-Level  Times(r/u/s)   Command [run=1/2]
6    running    /tmp/ts-out.FzVneG                           [l1]python infloop.py
0    finished   /tmp/ts-out.ixWHm2   0        0.00/0.00/0.00 bash -c echo 1
1    finished   /tmp/ts-out.ZzwS11   0        0.00/0.00/0.00 bash -c echo 1
2    finished   /tmp/ts-out.GJlyge   2        0.00/0.00/0.00 bash -c
4    finished   /tmp/ts-out.lIVMYH   2        0.00/0.00/0.00 bash -c -h
5    finished   /tmp/ts-out.8EKHy1   -1       141.23/0.00/0.00 python infloop.py
3    finished   /tmp/ts-out.lBr4Wy   -1       2545.36/0.00/0.02 bash -c python infloop.py
7    finished   /tmp/ts-out.kxCczi   2        0.01/0.00/0.00 bash -c
8    finished   /tmp/ts-out.3VkfNh   0        0.00/0.00/0.00 echo

对此:

ID,State,Output,E-Level,Times(r/u/s),Command [run=1/2]
6,running,/tmp/ts-out.FzVneG,[l1]python infloop.py
0,finished,/tmp/ts-out.ixWHm2,0,0.00/0.00/0.00,bash -c echo 1
1,finished,/tmp/ts-out.ZzwS11,0,0.00/0.00/0.00,bash -c echo 1
2,finished,/tmp/ts-out.GJlyge,2,0.00/0.00/0.00,bash -c
4,finished,/tmp/ts-out.lIVMYH,2,0.00/0.00/0.00,bash -c -h
5,finished,/tmp/ts-out.8EKHy1,-1,141.23/0.00/0.00,python infloop.py
3,finished,/tmp/ts-out.lBr4Wy,-1,2545.36/0.00/0.02,bash -c python infloop.py
7,finished,/tmp/ts-out.kxCczi,2,0.01/0.00/0.00,bash -c
8,finished,/tmp/ts-out.3VkfNh,0,0.00/0.00/0.00,echo

然后将其作为 CSV 加载到 pandas 中:

import pandas as pd
my_df = pd.read_csv(my_log_file)

我很抱歉,这不是一个有趣的纯 python 解决方案,但在我看来,bash 部分使 python 部分变得更容易。

© www.soinside.com 2019 - 2024. All rights reserved.