R代码从JSON日志文件中提取字段的值

问题描述 投票:0回答:1

我有一个文件,其中包含来自日志收集的50,000条记录。我需要为每个记录拉出“ State”:&“ Code”:之后的值。我已经尝试过正则表达式,但无法正常工作。取而代之的是,我尝试使用此命令查看是否可以得到其中一个值,但是它只是超时了。

#this never completes
sub(".*?Code(.*?);.*", "\\1", logfile 

我没有这类工作的经验,所以我感谢您的帮助!这就是日志文件的格式设置(实际上是JSON)。我的目标是返回以下值(如果不能包含State&Code,则可以):

((状态:红色,代码:空(状态:蓝色,代码:无收据)

下面是logfile的确切语法,其中有2条记录:

 "
    2020-05-12 00:07:00.9681200, z123-asddfas,"
    ========== mode for SKU ==========
    ========== Records found ==========
    No records found
    ========== DRecords found ==========
    No drecords found
    "
    2020-05-12 00:08:46.5076411,qwer98-asdha,"
    ========== mode for SKU ==========
    ========== records found ==========
    {
        "State":  "Red",
        "Code":  null
    }
    ========== DRecords found ==========
    No drecords found
    "
    2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
    ========== mode for SKU ==========
    ========== records found ==========
    {
        "State":  "Blue",
        "Code":  "no receipt"
    }
r regex string extraction
1个回答
1
投票

阅读您的文字

logIn <-  read_lines('"
    2020-05-12 00:07:00.9681200, z123-asddfas,"
========== mode for SKU ==========
  ========== Records found ==========
  No records found
========== DRecords found ==========
  No drecords found
"
    2020-05-12 00:08:46.5076411,qwer98-asdha,"
========== mode for SKU ==========
  ========== records found ==========
  {
    "State":  "Red",
    "Code":  null
  }
========== DRecords found ==========
  No drecords found
"
    2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
========== mode for SKU ==========
  ========== records found ==========
  {
    "State":  "Blue",
    "Code":  "no receipt"
  }')

将其放入可缠绕的形式,清理并过滤

@library(tidyverse)    
tibble(lines = logIn) %>% 
     # Keep only the lines with 'state' or 'code'
  filter(str_detect(lines, "(?ix) ( state | code )")) %>% 
     # Clean out all the whitespace and punct, except the ':'
  mutate(lines = str_replace_all(lines, '["\\s,]', '')) %>% 
     # Use separate to divide into two new columns
  separate(lines, c("ATTR", "VALUE"), sep = ":")

我们得到了什么?

# A tibble: 4 x 2
  ATTR  VALUE    
  <chr> <chr>    
1 State Red      
2 Code  null     
3 State Blue     
4 Code  noreceipt
##################### 按要求
tibble(lines = logIn) %>% 
  # Keep only the lines with 'state' or 'code'
  filter(str_detect(lines, "(?ix) ( state | code )")) %>% 
    # This ID will come in useful
  rowid_to_column("ID") %>% 
  # Clean out all the whitespace and punct, except the ':'
  mutate(lines = str_replace_all(lines, '["\\s,]', ''),
         # Give each State and Code the same ID.
         ID = floor((ID + 1) / 2)) %>% 
  # Use separate to divide into two new columns
  separate(lines, c("ATTR", "VALUE"), sep = ":") %>% 
    # spread take it from longform to wideform
  spread(key = ATTR, value = VALUE) %>% 
  select(ID, State, Code)

# A tibble: 2 x 3
     ID State Code     
  <dbl> <chr> <chr>    
1     1 Red   null     
2     2 Blue  noreceipt
© www.soinside.com 2019 - 2024. All rights reserved.