提取R中文本文件中的重复行/图案

问题描述 投票:0回答:2

我有一个用txt提取的长文本文件(readLines())。它具有重复模式,但我只对某些特定行感兴趣。这是我文件的简短版本:

[1] "Set            1"                                              
[2] "DVRJ, DVRI, DVRP, DVRR !Parameters"           
[3] "DVRJ = 0.0012150"                                                   
[4] "DVRI = 0.0007576"                                                   
[5] "DVRP = 0.0006010"                                                   
[6] "DVRR = 0.0020851"                                                   
[7] "TSTR, TSPI, TSF,  TSM  !Temperature"           
[8] "        0.00,      659.22,     1241.55,     1721.16"                
[9] "TGDDTR,TGDDPI,TGDDF,TGDDM  !GDD above TBD"            
[10] "        0.00,      660.52,     1246.67,     1726.62"                
[11] "DASTR , DASPI , DASF  , DASM  !Duration"
[12] "        0.00,       35.00,       70.00,      100.00"                
[13] "Set            2"                                              
[14] "DVRJ, DVRI, DVRP, DVRR !Parameters"           
[15] "DVRJ = 0.0012713"                                                   
[16] "DVRI = 0.0007576"                                                   
[17] "DVRP = 0.0005982"                                                   
[18] "DVRR = 0.0021067"                                                   
[19] "TSTR, TSPI, TSF,  TSM  !Temperature"           
[20] "        0.00,      644.65,     1229.76,     1704.44"                
[21] "TGDDTR,TGDDPI,TGDDF,TGDDM  !GDD above TBD"            
[22] "        0.00,      645.42,     1234.33,     1711.56"                
[23] "DASTR , DASPI , DASF  , DASM  !Duration"
[24] "        0.00,       35.00,       70.00,      100.00"                
[25] "Set            3"                                              
[26] "DVRJ, DVRI, DVRP, DVRR !Parameters"           
[27] "DVRJ = 0.0012713"                                                   
[28] "DVRI = 0.0007576"                                                   
[29] "DVRP = 0.0005982"                                                   
[30] "DVRR = 0.0021067"                                                   
[31] "TSTR, TSPI, TSF,  TSM  !Temperature"           
[32] "        0.00,      644.65,     1229.76,     1704.44"                
[33] "TGDDTR,TGDDPI,TGDDF,TGDDM  !GDD above TBD"            
[34] "        0.00,      645.42,     1234.33,     1711.56"                
[35] "DASTR , DASPI , DASF  , DASM  !Duration"
[36] "        0.00,       35.00,       70.00,      100.00" 

我只想得到:

Set *value*                                                      
DVRJ = *value*                                                 
DVRI = *value*                                                  
DVRP = *value*                                                  
DVRR = *value*

之后,我想将结果转换为一个看起来像这样的数据框:

  Set      DVRJ    DVRI     DVRP    DVRR
*value*  *value*  *value*  *value*  *value*
*value*  *value*  *value*  *value*  *value*
*value*  *value*  *value*  *value*  *value*

我首先尝试使用strsplit()删除不需要的行:

strsplit(txt, split = c("DVRJ, DVRI, DVRP, DVRR !Parameters",
                        "TSTR, TSPI, TSF,  TSM  !Temperature",
                        "TGDDTR,TGDDPI,TGDDF,TGDDM  !GDD above TBD",
                        "DASTR , DASPI , DASF  , DASM  !Duration"
                        ))

不仅不起作用,而且不会消除它们各自的价值。感谢您的帮助。谢谢!

r string parsing text text-files
2个回答
2
投票

我们可以使用:

library(dplyr)
library(tidyr)
#Select only specific lines which follows a pattern
data.frame(col = grep('(Set\\s+\\d+)|((DVRJ|DVRI|DVRP|DVRR)\\s+=)', 
                 lines, value = TRUE), stringsAsFactors = FALSE) %>%
   #Add same separator to "Set" as rest of data i.e "="
   mutate(col = ifelse(startsWith(col, 'Set'), gsub('\\s+', ' = ', col), col)) %>%
   #Divide data into different columns based on sep
   separate(col, c('col', 'value'), sep = " = ", convert = TRUE) %>%
   group_by(col) %>%
   #Create a unique index column
   mutate(Row = row_number()) %>%
   #Get data in wide format. 
   pivot_wider(names_from = col, values_from = value) %>%
   select(-Row)


# A tibble: 2 x 5
#    Set    DVRJ     DVRI     DVRP    DVRR
#  <dbl>   <dbl>    <dbl>    <dbl>   <dbl>
#1     1 0.00122 0.000758 0.000601 0.00209
#2     2 0.00127 0.000758 0.000598 0.00211

lines在哪里

lines <- c("Set            1", "DVRJ, DVRI, DVRP, DVRR !Parameters", 
"DVRJ = 0.0012150", "DVRI = 0.0007576", "DVRP = 0.0006010", "DVRR = 0.0020851", 
"TSTR, TSPI, TSF,  TSM  !Temperature", "        0.00,      659.22,     1241.55,     1721.16", 
"TGDDTR,TGDDPI,TGDDF,TGDDM  !GDD above TBD", "        0.00,      660.52,     1246.67,     1726.62", 
"DASTR , DASPI , DASF  , DASM  !Duration", "        0.00,       35.00,       70.00,      100.00", 
"Set            2", "DVRJ, DVRI, DVRP, DVRR !Parameters", "DVRJ = 0.0012713", "DVRI = 0.0007576", 
"DVRP = 0.0005982", "DVRR = 0.0021067")

1
投票

这里是基于Base R且具有固定记录字段的解决方案。我们使用read.fwf()读取输入的多个记录,解析出第1、3、4、5和6行上的所需数据。

首先,我们将输入数据从OP转换为R对象,以使示例可重现。

fixedText = "Set            1                                             
DVRJ, DVRI, DVRP, DVRR !Parameters                  
DVRJ = 0.0012150                                                  
DVRI = 0.0007576                                                  
DVRP = 0.0006010                                                  
DVRR = 0.0020851                                                  
TSTR, TSPI, TSF,  TSM  !Temperature                 
        0.00,      659.22,     1241.55,     1721.16               
TGDDTR,TGDDPI,TGDDF,TGDDM  !GDD above TBD           
        0.00,      660.52,     1246.67,     1726.62               
DASTR , DASPI , DASF  , DASM  !Duration             
        0.00,       35.00,       70.00,      100.00               
Set            2                                             
DVRJ, DVRI, DVRP, DVRR !Parameters                  
DVRJ = 0.0012713                                                  
DVRI = 0.0007576                                                  
DVRP = 0.0005982                                                  
DVRR = 0.0021067                                                  
TSTR, TSPI, TSF,  TSM  !Temperature                 
        0.00,      644.65,     1229.76,     1704.44               
TGDDTR,TGDDPI,TGDDF,TGDDM  !GDD above TBD           
        0.00,      645.42,     1234.33,     1711.56               
DASTR , DASPI , DASF  , DASM  !Duration             
        0.00,       35.00,       70.00,      100.00               
Set            3                                             
DVRJ, DVRI, DVRP, DVRR !Parameters                  
DVRJ = 0.0012713                                                  
DVRI = 0.0007576                                                  
DVRP = 0.0005982                                                  
DVRR = 0.0021067                                                  
TSTR, TSPI, TSF,  TSM  !Temperature                 
        0.00,      644.65,     1229.76,     1704.44               
TGDDTR,TGDDPI,TGDDF,TGDDM  !GDD above TBD           
        0.00,      645.42,     1234.33,     1711.56               
DASTR , DASPI , DASF  , DASM  !Duration             
        0.00,       35.00,       70.00,      100.00   
"

接下来,我们设置需要作为read.fwf()参数的对象,包括“宽度”列表以从每个观察文件的12行读取数据。列表中的负数表示未保存到输出数据帧的数据。

widthList <- list(c(-14,3,-45),
               c(-50),
               c(-7,9,-50),
               c(-7,9,-50),
               c(-7,9,-50),
               c(-7,9,-50),
               c(-50),
               c(-50),
               c(-50),
               c(-50),
               c(-50),
               c(-50))
theNames <- c("Set","DVRJ", "DVRI", "DVRP", "DVRR")

最后,我们运行read.fwf(),包括参数。

options(sicken = 10) # so we can see the 7th decimal place in data
data <- read.fwf(textConnection(fixedText), widths = widthList,
                 flush=TRUE,col.names = theNames)

...和输出:

> data
  Set      DVRJ      DVRI      DVRP      DVRR
1   1 0.0012150 0.0007576 0.0006010 0.0020851
2   2 0.0012713 0.0007576 0.0005982 0.0021067
3   3 0.0012713 0.0007576 0.0005982 0.0021067
> 
© www.soinside.com 2019 - 2024. All rights reserved.