如何通过不同目录迭代读取和处理文件?

问题描述 投票:0回答:1

我有多个.txt文件存放在不同的目录下(每个目录对应一个城市)。数据真的很大,因此我不想一次读取所有文件,而是想遍历城市目录并分别处理每个文件。

我正在分享我用来首先 ungz 文件以获取 txt 文件然后修复其他与列相关的问题(即列数和列名)的代码。在代码末尾返回“合并”后,我想将所有文件读取到一个 df,但为了进一步处理,它在城市目录中迭代运行以提高处理效率。

setwd("C:/Users/Alexia/Desktop/Data/Test_Gz")
#1. UNGZ the files
decompress <- function(file, dest = sub("\\.gz$", "", file)) {
  # Set up source and destination connections
  src <- gzfile(file, "rb")
  on.exit(close(src), add = TRUE)
  dst <- file(dest, "wb")
  on.exit(close(dst), add = TRUE)

  # Copy decompressed contents from source to destination
  BATCH_SIZE <- 10 * 1024^2
  repeat {
    bytes <- readBin(src, raw(), BATCH_SIZE)
    if (length(bytes) != 0) {
      writeBin(bytes, dst)
    } else {
      break
    }
  }

  invisible(dest)
}

files <- list.files(pattern = "*.gz", full.names = TRUE, recursive = TRUE)
for (file in files) {
  decompress(file)
}

#2. FIXING COLUMN ISSUES
library(data.table)

#List of files
filelist <- list.files("C:/Users/Alexia/Desktop/Data/Test_Gz/", full.names = TRUE, recursive 
                   = TRUE, pattern = ".txt$")

#Read the files
dt <- lapply(filelist, function(file) {
  lines <- readLines(file)
  comment_end = match("*/", lines)
  fread(file, skip = comment_end)
})

#Adjust Column names
dt.tidied <- lapply(dt, FUN = function(x){
  #adjust ? to degree
  setnames(x, old = "T2 [?C]", new = "T2 [°C]", skip_absent = TRUE)

  colnames(x) <- gsub("\\[", "(", colnames(x))
  colnames(x) <- gsub("\\]", ")", colnames(x))

  #return
  return(x)
})

#bind, filling missing columns to NA
merged <- rbindlist(dt.tidied, fill = TRUE, use.names = TRUE)

例如我接下来的步骤如下。现在,我希望以下代码在城市目录中迭代运行。

library(dplyr)
library(lubridate)
mn <- merged %>% separate(`Date/Time`, into = c("Date", "Time"), sep = "T")
mnf <- mn %>% 
  as_tibble() %>%
  group_by(group = as.integer(gl(n(), 15, n()))) %>%
  summarise(across(everything(), ~ if(mean(is.na(.x)) > 0.5) NA else mean(.x, na.rm = TRUE)))
write.csv(mnf, 'C:/Users/Alexia/Desktop/Data/Test_Gz/Mean_15.csv')

谁能帮我修改代码,让它在城市目录中迭代运行。 附言如果有人认为#1 和#2 的迭代过程也有帮助,请随时修改#1 和#2 代码以迭代运行它。

更多信息,文件顺序如下:

WorkingDirectory
 DET (City A)
  DET_2022_02_01.txt.gz                  #It is monthly file containing per minute data
  DET_2021_12_01.txt.gz                  #There are missing files
  DET_2021_11_01.txt.gz                  #The start and end date of every city differs
  ..
  ..
 MUN (City B)
  MUN_2020_12_01.txt.gz                  
  MUN_2020_11_01.txt.gz
  MUN_2020_08_01.txt.gz
  ..
  ..

我希望我已经澄清了我的问题。

r dplyr iteration tidyr lubridate
1个回答
0
投票

我会加一个

for
。像这样的东西:

#2. FIXING COLUMN ISSUES
library(data.table)

#List of directories
citylist <- list.dirs("C:/Users/Alexia/Desktop/Data/Test_Gz/")
  
for (citydir in citylist) {

#List of files
filelist <- list.files(citydir, full.names = TRUE, recursive 
                   = TRUE, pattern = ".txt$")

#Read the files
dt <- lapply(filelist, function(file) {
  lines <- readLines(file)
  comment_end = match("*/", lines)
  fread(file, skip = comment_end)
})

#Adjust Column names
dt.tidied <- lapply(dt, FUN = function(x){
  #adjust ? to degree
  setnames(x, old = "T2 [?C]", new = "T2 [°C]", skip_absent = TRUE)

  colnames(x) <- gsub("\\[", "(", colnames(x))
  colnames(x) <- gsub("\\]", ")", colnames(x))

  #return
  return(x)
})

#bind, filling missing columns to NA
merged <- rbindlist(dt.tidied, fill = TRUE, use.names = TRUE)
  }

© www.soinside.com 2019 - 2024. All rights reserved.