读取包含/不包含空格和数字的不规则格式文本文件

问题描述 投票:0回答:3

示例数据如下所示(仅供参考,我有数百个这样的文件)。棘手的部分是文件中的“NO RECORD”。我没有尝试几个小时将其放入 R 中,但没有成功

BEGIN DATA
RIM
DATE           AF         QD         QU 
09/30/1920      NO RECORD       370.00  NO RECORD   
10/01/1920      NO RECORD       391.00     391.00 
10/02/1920      NO RECORD       496.00    MISSING 
10/03/1920      NO RECORD       660.00    MISSING 
10/04/1920      NO RECORD       881.00    MISSING 
10/05/1920      NO RECORD       660.00    MISSING 
10/06/1920      NO RECORD       515.00    -9999 
10/07/1920      NO RECORD       443.00    NO RECORD 
10/08/1920      NO RECORD       443.00    MISSING 
10/09/1920      NO RECORD       443.00    443.00 
10/10/1920      NO RECORD       443.00    MISSING

这是我最新的R代码

library(zoo)

# function to read data
obsRead <- function(path2file, filename, number_line_skip, header_or_not) {
  tmpName <- paste(path2file, filename, sep="")
  tmpData <- read.zoo(tmpName,
                   tz='', stringsAsFactors = FALSE, strip.white = TRUE,
                   header=header_or_not, skip=number_line_skip, 
                   na.strings = c("NA", "N/A", "MISSING", "NO RECORD", "-9999"), # tell zoo what NA values look like
  qName <- c("AF", "QD", "QU")
  names(tmpData) <- qName
  index(tmpData) <- as.Date(index(tmpData)) # Convert index from POSIXct to Date
  str(tmpData)
  return(tmpData)  
}

dataDir = "path/to/file/"
dataFile <- "sampleData.txt"
nLineSkip <- 3
header_or_not <- FALSE

Q_obs <- obsRead(dataDir, dataFile, nLineSkip, header_or_not)

以及我从 R 得到的错误

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 6 elements

如有任何建议,我们将不胜感激!谢谢!

编辑 @MichaelChirico 使用

data.table
here

发布了另一个解决方案
 fread(text=x, sep = " ", header=TRUE, fill=length(unlist(strsplit(x[1], " "))), na.strings=strrep("*", 1:6))
r time-series zoo na read.table
3个回答
5
投票

试试这个:

library(zoo)
L <- readLines("path/to/file/sampleData.txt")
L <- gsub("NO RECORD", "NO_RECORD", L)
z <- read.zoo(text = L, header = TRUE, skip = 2, format = "%m/%d/%Y",
        na.strings = c("NA", "N/A", "MISSING", "NO_RECORD", "-9999"))
z

给予:

> z
           AF  QD  QU
1920-09-30 NA 370  NA
1920-10-01 NA 391 391
1920-10-02 NA 496  NA
1920-10-03 NA 660  NA
1920-10-04 NA 881  NA
1920-10-05 NA 660  NA
1920-10-06 NA 515  NA
1920-10-07 NA 443  NA
1920-10-08 NA 443  NA
1920-10-09 NA 443 443
1920-10-10 NA 443  NA

5
投票

扩展@MrFlick的评论建议

sed
fread
直接接受系统命令:

fread("sed 's/NO RECORD/NORECORD/' < yourFile.txt")

4
投票

只要

NO RECORD
一致,这应该会给你一个开始:

tmp <- readLines("sample.dat")

# substitute 'NO RECORD' for 'NORECORD' and use read.table()
# to process the collapsed vector

tmp.collapse <- paste(gsub("NO RECORD", "NORECORD", 
                      tmp[4:length(tmp)]), sep="", collapse="\n")

# get the column names from the third row and use them in the data table

read.table(textConnection(tmp.collapse), 
           header=FALSE, stringsAsFactors=FALSE, 
           col.names=unlist(strsplit(tmp[3], "\ +")))

##          DATE       AF  QD       QU
## 1  09/30/1920 NORECORD 370 NORECORD
## 2  10/01/1920 NORECORD 391   391.00
## 3  10/02/1920 NORECORD 496  MISSING
## 4  10/03/1920 NORECORD 660  MISSING
## 5  10/04/1920 NORECORD 881  MISSING
## 6  10/05/1920 NORECORD 660  MISSING
## 7  10/06/1920 NORECORD 515    -9999
## 8  10/07/1920 NORECORD 443 NORECORD
## 9  10/08/1920 NORECORD 443  MISSING
## 10 10/09/1920 NORECORD 443   443.00
## 11 10/10/1920 NORECORD 443  MISSING
© www.soinside.com 2019 - 2024. All rights reserved.