删除文本文件中某些行然后使用 fread() 转换为表格的有效方法

问题描述 投票:0回答:0

我有一个如下所示格式的数据文件。有 5 列和大约 2000000 行

# some text
# some more text
#
#       Column names                                            Units                         
#       ------------------------------------------------------  ------------------------ 
#@   1  "aaaa"                                                  "s"                       
#@   2  "bbbbbbb "                                              "kg"                     
#@   3  "cccccccc"                                              "m"                     
#@   4  "dddddddd"                                              "lb"                     
#@   5  "eeeeeeee"                                              "m"                     

2 4 5 6 7 
7 8 9 3 2 
...
...
...

# row 145800
# row 145801
# row 145802 
# row 145803
# row 145804

3 4 6 7 9 

想法是使用 fread() 创建数据框。在那之前, 我需要跳过包含“#”字符的行。此示例中的一个问题是“#”也出现在中间某处 文本文件,如第 145800 行到 145804 行。所以我将数据拆分为两个不同的字符向量,然后将它们合并以删除第 145800 行到 145804 行中的“#”。保留带有“#@”的行的原因是列名。将它们映射到列后,我将稍后删除它们

# pathof data file 
path <-  "C:/data.txt"

# read original data file. 
# Does the same as readLines() - inspired by https://stackoverflow.com/questions/32920031/how-to-use-fread-as-readlines-without-auto-column-detection
lines_original <- fread(path, sep= "?", header = FALSE)[[1L]] 

# Read the first 100 lines of the file into a character vector
lines_subset<- fread(path, sep= "?", header = FALSE, nrows = 100)[[1L]]


# Identify the lines that contain the special character in the first 100 lines
special_lines_1 <- grep("\\#", lines_subset)

# Identify the lines that contain the special character in the entire file
special_lines_2 <- grep("\\#", lines_original)


# Subset of lines_subset containing "#" 
lines_1 <- lines_subset[special_lines_1] 


# Subset of lines_original  containing "#" 

lines_2 <- lines_original[-special_lines_2]

# merging lines_1 and lines_2 so that "#" is removed everywhere apart from first 100 lines 
lines_new <- c(lines_1, lines_2)


skip <- tail(grep("\\#", readLines(textConnection(lines_new))),1)

我现在想使用以下代码将 lines_new 转换为数据框

df <- fread(text = lines_new, skip = skip,  header = FALSE)

如你所见,我多次调用fread(),有没有办法避免在最后使用fread(),因为数据已经导入到内存中了?

r utf-8 fread
© www.soinside.com 2019 - 2024. All rights reserved.