我有一个用txt
提取的长文本文件(readLines()
)。它具有重复模式,但我只对某些特定行感兴趣。这是我文件的简短版本:
[1] "Set 1"
[2] "DVRJ, DVRI, DVRP, DVRR !Parameters"
[3] "DVRJ = 0.0012150"
[4] "DVRI = 0.0007576"
[5] "DVRP = 0.0006010"
[6] "DVRR = 0.0020851"
[7] "TSTR, TSPI, TSF, TSM !Temperature"
[8] " 0.00, 659.22, 1241.55, 1721.16"
[9] "TGDDTR,TGDDPI,TGDDF,TGDDM !GDD above TBD"
[10] " 0.00, 660.52, 1246.67, 1726.62"
[11] "DASTR , DASPI , DASF , DASM !Duration"
[12] " 0.00, 35.00, 70.00, 100.00"
[13] "Set 2"
[14] "DVRJ, DVRI, DVRP, DVRR !Parameters"
[15] "DVRJ = 0.0012713"
[16] "DVRI = 0.0007576"
[17] "DVRP = 0.0005982"
[18] "DVRR = 0.0021067"
[19] "TSTR, TSPI, TSF, TSM !Temperature"
[20] " 0.00, 644.65, 1229.76, 1704.44"
[21] "TGDDTR,TGDDPI,TGDDF,TGDDM !GDD above TBD"
[22] " 0.00, 645.42, 1234.33, 1711.56"
[23] "DASTR , DASPI , DASF , DASM !Duration"
[24] " 0.00, 35.00, 70.00, 100.00"
[25] "Set 3"
[26] "DVRJ, DVRI, DVRP, DVRR !Parameters"
[27] "DVRJ = 0.0012713"
[28] "DVRI = 0.0007576"
[29] "DVRP = 0.0005982"
[30] "DVRR = 0.0021067"
[31] "TSTR, TSPI, TSF, TSM !Temperature"
[32] " 0.00, 644.65, 1229.76, 1704.44"
[33] "TGDDTR,TGDDPI,TGDDF,TGDDM !GDD above TBD"
[34] " 0.00, 645.42, 1234.33, 1711.56"
[35] "DASTR , DASPI , DASF , DASM !Duration"
[36] " 0.00, 35.00, 70.00, 100.00"
我只想得到:
Set *value*
DVRJ = *value*
DVRI = *value*
DVRP = *value*
DVRR = *value*
之后,我想将结果转换为一个看起来像这样的数据框:
Set DVRJ DVRI DVRP DVRR
*value* *value* *value* *value* *value*
*value* *value* *value* *value* *value*
*value* *value* *value* *value* *value*
我首先尝试使用strsplit()
删除不需要的行:
strsplit(txt, split = c("DVRJ, DVRI, DVRP, DVRR !Parameters",
"TSTR, TSPI, TSF, TSM !Temperature",
"TGDDTR,TGDDPI,TGDDF,TGDDM !GDD above TBD",
"DASTR , DASPI , DASF , DASM !Duration"
))
不仅不起作用,而且不会消除它们各自的价值。感谢您的帮助。谢谢!
我们可以使用:
library(dplyr)
library(tidyr)
#Select only specific lines which follows a pattern
data.frame(col = grep('(Set\\s+\\d+)|((DVRJ|DVRI|DVRP|DVRR)\\s+=)',
lines, value = TRUE), stringsAsFactors = FALSE) %>%
#Add same separator to "Set" as rest of data i.e "="
mutate(col = ifelse(startsWith(col, 'Set'), gsub('\\s+', ' = ', col), col)) %>%
#Divide data into different columns based on sep
separate(col, c('col', 'value'), sep = " = ", convert = TRUE) %>%
group_by(col) %>%
#Create a unique index column
mutate(Row = row_number()) %>%
#Get data in wide format.
pivot_wider(names_from = col, values_from = value) %>%
select(-Row)
# A tibble: 2 x 5
# Set DVRJ DVRI DVRP DVRR
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.00122 0.000758 0.000601 0.00209
#2 2 0.00127 0.000758 0.000598 0.00211
lines
在哪里
lines <- c("Set 1", "DVRJ, DVRI, DVRP, DVRR !Parameters",
"DVRJ = 0.0012150", "DVRI = 0.0007576", "DVRP = 0.0006010", "DVRR = 0.0020851",
"TSTR, TSPI, TSF, TSM !Temperature", " 0.00, 659.22, 1241.55, 1721.16",
"TGDDTR,TGDDPI,TGDDF,TGDDM !GDD above TBD", " 0.00, 660.52, 1246.67, 1726.62",
"DASTR , DASPI , DASF , DASM !Duration", " 0.00, 35.00, 70.00, 100.00",
"Set 2", "DVRJ, DVRI, DVRP, DVRR !Parameters", "DVRJ = 0.0012713", "DVRI = 0.0007576",
"DVRP = 0.0005982", "DVRR = 0.0021067")
这里是基于Base R且具有固定记录字段的解决方案。我们使用read.fwf()
读取输入的多个记录,解析出第1、3、4、5和6行上的所需数据。
首先,我们将输入数据从OP转换为R对象,以使示例可重现。
fixedText = "Set 1 DVRJ, DVRI, DVRP, DVRR !Parameters DVRJ = 0.0012150 DVRI = 0.0007576 DVRP = 0.0006010 DVRR = 0.0020851 TSTR, TSPI, TSF, TSM !Temperature 0.00, 659.22, 1241.55, 1721.16 TGDDTR,TGDDPI,TGDDF,TGDDM !GDD above TBD 0.00, 660.52, 1246.67, 1726.62 DASTR , DASPI , DASF , DASM !Duration 0.00, 35.00, 70.00, 100.00 Set 2 DVRJ, DVRI, DVRP, DVRR !Parameters DVRJ = 0.0012713 DVRI = 0.0007576 DVRP = 0.0005982 DVRR = 0.0021067 TSTR, TSPI, TSF, TSM !Temperature 0.00, 644.65, 1229.76, 1704.44 TGDDTR,TGDDPI,TGDDF,TGDDM !GDD above TBD 0.00, 645.42, 1234.33, 1711.56 DASTR , DASPI , DASF , DASM !Duration 0.00, 35.00, 70.00, 100.00 Set 3 DVRJ, DVRI, DVRP, DVRR !Parameters DVRJ = 0.0012713 DVRI = 0.0007576 DVRP = 0.0005982 DVRR = 0.0021067 TSTR, TSPI, TSF, TSM !Temperature 0.00, 644.65, 1229.76, 1704.44 TGDDTR,TGDDPI,TGDDF,TGDDM !GDD above TBD 0.00, 645.42, 1234.33, 1711.56 DASTR , DASPI , DASF , DASM !Duration 0.00, 35.00, 70.00, 100.00 "
接下来,我们设置需要作为
read.fwf()
参数的对象,包括“宽度”列表以从每个观察文件的12行读取数据。列表中的负数表示未保存到输出数据帧的数据。
widthList <- list(c(-14,3,-45), c(-50), c(-7,9,-50), c(-7,9,-50), c(-7,9,-50), c(-7,9,-50), c(-50), c(-50), c(-50), c(-50), c(-50), c(-50)) theNames <- c("Set","DVRJ", "DVRI", "DVRP", "DVRR")
最后,我们运行
read.fwf()
,包括参数。
options(sicken = 10) # so we can see the 7th decimal place in data data <- read.fwf(textConnection(fixedText), widths = widthList, flush=TRUE,col.names = theNames)
...和输出:
> data
Set DVRJ DVRI DVRP DVRR
1 1 0.0012150 0.0007576 0.0006010 0.0020851
2 2 0.0012713 0.0007576 0.0005982 0.0021067
3 3 0.0012713 0.0007576 0.0005982 0.0021067
>