我有一个如下所示格式的数据文件。有 5 列和大约 2000000 行
# some text
# some more text
#
# Column names Units
# ------------------------------------------------------ ------------------------
#@ 1 "aaaa" "s"
#@ 2 "bbbbbbb " "kg"
#@ 3 "cccccccc" "m"
#@ 4 "dddddddd" "lb"
#@ 5 "eeeeeeee" "m"
2 4 5 6 7
7 8 9 3 2
...
...
...
# row 145800
# row 145801
# row 145802
# row 145803
# row 145804
3 4 6 7 9
想法是使用 fread() 创建数据框。在那之前, 我需要跳过包含“#”字符的行。此示例中的一个问题是“#”也出现在中间某处 文本文件,如第 145800 行到 145804 行。所以我将数据拆分为两个不同的字符向量,然后将它们合并以删除第 145800 行到 145804 行中的“#”。保留带有“#@”的行的原因是列名。将它们映射到列后,我将稍后删除它们
# pathof data file
path <- "C:/data.txt"
# read original data file.
# Does the same as readLines() - inspired by https://stackoverflow.com/questions/32920031/how-to-use-fread-as-readlines-without-auto-column-detection
lines_original <- fread(path, sep= "?", header = FALSE)[[1L]]
# Read the first 100 lines of the file into a character vector
lines_subset<- fread(path, sep= "?", header = FALSE, nrows = 100)[[1L]]
# Identify the lines that contain the special character in the first 100 lines
special_lines_1 <- grep("\\#", lines_subset)
# Identify the lines that contain the special character in the entire file
special_lines_2 <- grep("\\#", lines_original)
# Subset of lines_subset containing "#"
lines_1 <- lines_subset[special_lines_1]
# Subset of lines_original containing "#"
lines_2 <- lines_original[-special_lines_2]
# merging lines_1 and lines_2 so that "#" is removed everywhere apart from first 100 lines
lines_new <- c(lines_1, lines_2)
skip <- tail(grep("\\#", readLines(textConnection(lines_new))),1)
我现在想使用以下代码将 lines_new 转换为数据框
df <- fread(text = lines_new, skip = skip, header = FALSE)
如你所见,我多次调用fread(),有没有办法避免在最后使用fread(),因为数据已经导入到内存中了?