我正在尝试读取一个文件夹中的多个.csv文件,并将所有数据组合到一个数据框中以进行分析和图形化。通常,我将使用这种方法来加载和合并所有文件。
file_list <- list.files(paste(WorkingDirectory, "/Transducer Data", sep= ""), pattern = "*.csv",
full.names = TRUE)
for (file in file_list){
all_transducer_file <- read.csv(file, header = F, as.is = T, sep= ",", skip = 15)
}
但是,我遇到了两个问题。1.生成的.csv在数据之前具有不同的行数。数据的标题始终为:“日期和时间”,“秒”,“压力(PSI)”和“地表水位(ft)”。自从上次数据提取以来,设备引发的错误数量取决于行数。2.数据有时加载为“ chr”类型,有时加载为“ factor”类型。我不太了解两者之间的区别,也不了解这可能如何影响编码。
是否有一种方法可以跳过前X行来打开csv,其中X基于可以找到指定标头的位置?
谢谢!梅尔
由于您知道Date and Time
出现在标题中,请尝试以下操作:
library(data.table)
fread(filename, skip = "Date and Time")
请参阅?fread
以获取您可能需要或不需要的其他参数。
所以这是解决当前问题的方法;
问题和解决方案:
# Setting the file path which contains the csv data
file_list <-
list.files(paste(WorkingDirectory, "/Transducer Data", sep= ""), pattern = "*.csv",
full.names = TRUE)
# Here we get the line at which the table we want starts
# sapply is used to loop on each file we have
# grep("Date and Time", readr::read_lines(x))[1] -> reads lines of data and get row at which Date time exist
# We minus this row by one to use it as skip number
skip_lines <-
sapply(file_list, function(x){grep("Date and Time", readr::read_lines(x))[1] - 1},
USE.NAMES = FALSE)
# Here I am using purrr to loop on data but you can use
# a normal loop or apply family, the benefit of map_df (function in purrr)
# is that it automatically returns data as a dataframe without needing to bind it
library(purrr)
# Method one using read.csv
1:length(file_list) %>% # I am looping on the files
map_df(function(x){
# For each file we read it skipping number of rows in skip_lines vector
# stringsAsFactors = FALSE -> to avoid conversion of any column to factor (both character and factor will be character)
read.csv(file_list[x], skip = skip_lines[x], stringsAsFactors = FALSE)
})
# Method two using read_csv
1:length(file_list) %>%
map_df(function(x){
readr::read_csv(file_list[x], skip = skip_lines[x], col_types = cols())
})