我有一个文件,其中包含来自日志收集的50,000条记录。我需要为每个记录拉出“ State”:&“ Code”:之后的值。我已经尝试过正则表达式,但无法正常工作。取而代之的是,我尝试使用此命令查看是否可以得到其中一个值,但是它只是超时了。
#this never completes
sub(".*?Code(.*?);.*", "\\1", logfile
我没有这类工作的经验,所以我感谢您的帮助!这就是日志文件的格式设置(实际上是JSON)。我的目标是返回以下值(如果不能包含State&Code,则可以):
((状态:红色,代码:空(状态:蓝色,代码:无收据)
下面是logfile的确切语法,其中有2条记录:
"
2020-05-12 00:07:00.9681200, z123-asddfas,"
========== mode for SKU ==========
========== Records found ==========
No records found
========== DRecords found ==========
No drecords found
"
2020-05-12 00:08:46.5076411,qwer98-asdha,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Red",
"Code": null
}
========== DRecords found ==========
No drecords found
"
2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Blue",
"Code": "no receipt"
}
阅读您的文字
logIn <- read_lines('"
2020-05-12 00:07:00.9681200, z123-asddfas,"
========== mode for SKU ==========
========== Records found ==========
No records found
========== DRecords found ==========
No drecords found
"
2020-05-12 00:08:46.5076411,qwer98-asdha,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Red",
"Code": null
}
========== DRecords found ==========
No drecords found
"
2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Blue",
"Code": "no receipt"
}')
将其放入可缠绕的形式,清理并过滤
@library(tidyverse)
tibble(lines = logIn) %>%
# Keep only the lines with 'state' or 'code'
filter(str_detect(lines, "(?ix) ( state | code )")) %>%
# Clean out all the whitespace and punct, except the ':'
mutate(lines = str_replace_all(lines, '["\\s,]', '')) %>%
# Use separate to divide into two new columns
separate(lines, c("ATTR", "VALUE"), sep = ":")
我们得到了什么?
# A tibble: 4 x 2
ATTR VALUE
<chr> <chr>
1 State Red
2 Code null
3 State Blue
4 Code noreceipt
##################### 按要求tibble(lines = logIn) %>%
# Keep only the lines with 'state' or 'code'
filter(str_detect(lines, "(?ix) ( state | code )")) %>%
# This ID will come in useful
rowid_to_column("ID") %>%
# Clean out all the whitespace and punct, except the ':'
mutate(lines = str_replace_all(lines, '["\\s,]', ''),
# Give each State and Code the same ID.
ID = floor((ID + 1) / 2)) %>%
# Use separate to divide into two new columns
separate(lines, c("ATTR", "VALUE"), sep = ":") %>%
# spread take it from longform to wideform
spread(key = ATTR, value = VALUE) %>%
select(ID, State, Code)
# A tibble: 2 x 3
ID State Code
<dbl> <chr> <chr>
1 1 Red null
2 2 Blue noreceipt