我有多个.txt文件存放在不同的目录下(每个目录对应一个城市)。数据真的很大,因此我不想一次读取所有文件,而是想遍历城市目录并分别处理每个文件。
我正在分享我用来首先 ungz 文件以获取 txt 文件然后修复其他与列相关的问题(即列数和列名)的代码。在代码末尾返回“合并”后,我想将所有文件读取到一个 df,但为了进一步处理,它在城市目录中迭代运行以提高处理效率。
setwd("C:/Users/Alexia/Desktop/Data/Test_Gz")
#1. UNGZ the files
decompress <- function(file, dest = sub("\\.gz$", "", file)) {
# Set up source and destination connections
src <- gzfile(file, "rb")
on.exit(close(src), add = TRUE)
dst <- file(dest, "wb")
on.exit(close(dst), add = TRUE)
# Copy decompressed contents from source to destination
BATCH_SIZE <- 10 * 1024^2
repeat {
bytes <- readBin(src, raw(), BATCH_SIZE)
if (length(bytes) != 0) {
writeBin(bytes, dst)
} else {
break
}
}
invisible(dest)
}
files <- list.files(pattern = "*.gz", full.names = TRUE, recursive = TRUE)
for (file in files) {
decompress(file)
}
#2. FIXING COLUMN ISSUES
library(data.table)
#List of files
filelist <- list.files("C:/Users/Alexia/Desktop/Data/Test_Gz/", full.names = TRUE, recursive
= TRUE, pattern = ".txt$")
#Read the files
dt <- lapply(filelist, function(file) {
lines <- readLines(file)
comment_end = match("*/", lines)
fread(file, skip = comment_end)
})
#Adjust Column names
dt.tidied <- lapply(dt, FUN = function(x){
#adjust ? to degree
setnames(x, old = "T2 [?C]", new = "T2 [°C]", skip_absent = TRUE)
colnames(x) <- gsub("\\[", "(", colnames(x))
colnames(x) <- gsub("\\]", ")", colnames(x))
#return
return(x)
})
#bind, filling missing columns to NA
merged <- rbindlist(dt.tidied, fill = TRUE, use.names = TRUE)
例如我接下来的步骤如下。现在,我希望以下代码在城市目录中迭代运行。
library(dplyr)
library(lubridate)
mn <- merged %>% separate(`Date/Time`, into = c("Date", "Time"), sep = "T")
mnf <- mn %>%
as_tibble() %>%
group_by(group = as.integer(gl(n(), 15, n()))) %>%
summarise(across(everything(), ~ if(mean(is.na(.x)) > 0.5) NA else mean(.x, na.rm = TRUE)))
write.csv(mnf, 'C:/Users/Alexia/Desktop/Data/Test_Gz/Mean_15.csv')
谁能帮我修改代码,让它在城市目录中迭代运行。 附言如果有人认为#1 和#2 的迭代过程也有帮助,请随时修改#1 和#2 代码以迭代运行它。
更多信息,文件顺序如下:
WorkingDirectory
DET (City A)
DET_2022_02_01.txt.gz #It is monthly file containing per minute data
DET_2021_12_01.txt.gz #There are missing files
DET_2021_11_01.txt.gz #The start and end date of every city differs
..
..
MUN (City B)
MUN_2020_12_01.txt.gz
MUN_2020_11_01.txt.gz
MUN_2020_08_01.txt.gz
..
..
我希望我已经澄清了我的问题。
我会加一个
for
。像这样的东西:
#2. FIXING COLUMN ISSUES
library(data.table)
#List of directories
citylist <- list.dirs("C:/Users/Alexia/Desktop/Data/Test_Gz/")
for (citydir in citylist) {
#List of files
filelist <- list.files(citydir, full.names = TRUE, recursive
= TRUE, pattern = ".txt$")
#Read the files
dt <- lapply(filelist, function(file) {
lines <- readLines(file)
comment_end = match("*/", lines)
fread(file, skip = comment_end)
})
#Adjust Column names
dt.tidied <- lapply(dt, FUN = function(x){
#adjust ? to degree
setnames(x, old = "T2 [?C]", new = "T2 [°C]", skip_absent = TRUE)
colnames(x) <- gsub("\\[", "(", colnames(x))
colnames(x) <- gsub("\\]", ")", colnames(x))
#return
return(x)
})
#bind, filling missing columns to NA
merged <- rbindlist(dt.tidied, fill = TRUE, use.names = TRUE)
}