如何将日志文件的多行连接到1个数据帧行?
添加一行 - 示例4行日志文件:
[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
[DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE (( mr.cap_em >
0 AND mr.cap_em > 5
)) ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
[ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
workflow rule = BenCommonResources-getDataRecords
version = 2.0
filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
Result code = ruleFailed
Result message = Database error while processing request.
Result details = null
]]
[INFO ][2019-03-15 12:34:55,886][DefaultListableBeanFactory] - [Overriding bean definition for bean 'cpnreq': replacing [Generic bean: class [com.ar.moves.domain.bom.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/404.jar!/com/ar/moves/moves-context.xml]] with [Generic bean: class [com.ar.bl.bom.domain.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/Tools/Tomcatv8.5-appGit-master/404.jar!/com/ar/bl/bom/bl-bom-context.xml]]]
(参见https://pastebin.com/bsmWWCgw的代表性8行提取物。)
结构很干净:
[PRIOR][datetime][ClassName] - [Msg]
但是消息通常是多行的,消息本身可能有多个括号(甚至尾随...),或者^ M换行符,但不一定......这使得解析起来很困难。不知道从哪里开始......
因此,为了处理这样的文件,并能够用以下内容读取它:
#!/usr/bin/env Rscript
df <- read.table('D:/logfile.log')
我们真的需要首先进行合并。那怎么可行?
目标是加载整个日志文件以进行图形,分析(挖出东西),并最终将其写回文件,因此 - 如果可能的话 - 应保留换行符以尊重原始格式。
预期的数据框架如下所示:
PRIOR Datetime ClassName Msg
----- ------------------- ------------------- ----------
WARN 2016-12-16 13:43:10 ConfigManagerLoader Low max...
DEBUG 2016-05-26 10:10:22 DataSourceImpl SELECT ...
而且,理想情况下,这应该在R中直接执行(?),以便我们可以“处理”实时日志文件(由服务器应用程序以写入模式打开),“àlatail -f
”。
这是一个非常邪恶的正则表达式炸弹。我建议使用stringr
包,但你可以使用grep
样式函数完成所有这些。
library(stringr)
str <- c(
'[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
[DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE (( mr.cap_em >
0 AND mr.cap_em > 5
)) ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
[ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
workflow rule = BenCommonResources-getDataRecords
version = 2.0
filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
Result code = ruleFailed
Result message = Database error while processing request.
Result details = null
]]'
)
使用正则表达式,我们可以通过检查您提到的模式来拆分每一行。这个正则表达式正在检查[
,后跟任何非换行字符或换行符或回车符,然后是[
。但是使用*?
这是一种懒惰(非贪婪)的方式。重复3次,然后检查-
。最后,检查[
,然后是任何字符或包含方括号内信息的组,然后是]
。那是满口的。将其键入正则表达式计算器。只记得删除额外的反冲(在正则表达式计算器中使用\
但在使用R \\
时)。
# Split the text into each line without using \n or \r.
# pattern for each line is a lazy (non-greedy) [][][] - []
linesplit <- str %>%
# str_remove_all("\n") %>%
# str_extract_all('\\[(.|\\n|\\r)+\\]')
str_extract_all('\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\] - \\[(.|\\n|\\r|(\\[(.|\\n|\\r)*?\\]))*?\\]') %>%
unlist()
linesplit # Run this to view what happened
现在我们将每行分开,将它们分成列。但是我们不想保留[
或]
所以我们在正则表达式代码中使用正面的lookbehind和正向前瞻来检查是否存在而没有捕获它们。哦,当然,捕捉他们之间的一切。
# Split each line into columns
colsplit <- linesplit %>%
str_extract_all("(?<=\\[)(.|\\n|\\r)*?(?=\\])")
colsplit # Run this to view what happened
现在我们有一个列表,每行都有一个对象。在每个对象中,每列有4个项目。我们需要将这4个项目转换为数据框,然后将这些数据框架连接在一起。
# Convert each line to a dataframe, then join the dataframes together
df <- lapply(colsplit,
function(x){
data.frame(
PRIOR = x[1],
Datetime = x[2],
ClassName = x[3],
Msg = x[4],
stringsAsFactors = FALSE
)
}
) %>%
do.call(rbind,.)
df
# PRIOR Datetime ClassName Msg
# 1 WARN 2016-12-16 13:43:10,138 ConfigManagerLoader Low max memory=
# 2 DEBUG 2016-05-26 10:10:22,185 DataSourceImpl SELECT mr.lb_id
# 3 ERROR 2016-12-21 13:51:04,710 DWRWorkflowService Update Wizard -
# Note: there are extra spaces that probably should be trimmed,
# and the dates are slightly messed up. I'll leave those for the
# questioner to fix using a mutate and the string functions.
我将留给你修复额外的空间和日期字段。