R代码,导入需要行绑定但具有不同宽度的固定宽度文件

问题描述 投票:0回答:1

一直在处理一些混乱的数据。我有72个文件:9个(CCLF1-CCLF9)表分为8个部分。每个文件都是一个固定宽度的文件,我有一个正确的宽度和列名称的字典,我从excel电子表格中提取。

问题是,我在加载它们后意识到存在另外11个字符的文件。所以这会导致它们绑定不正确。

我想要做的是搜索文件,看看每行是否有正确的字符数,如果是,在正确的位置添加11个空格。我将添加一个列名“random_11_spaces”,然后将它们绑定在一起并删除“random_11_spaces”列。

例如,CCLF1文件的宽度应为177,但其子文件中的5个宽度为188.我想检查它是否等于177,如果是,则添加11个字符,否则加载文件。

我只是不知道该怎么做。这是我到目前为止加载数据的内容:

  # Pull in the CCLF Details xlsx and put them into a list
details_path <- paste0(mappingPath,"CCLF Dictionary.xlsx")
sheetnames <- excel_sheets(details_path)
CCLF_details <- lapply(sheetnames,read_excel, path = details_path)
names(CCLF_details) <- sheetnames

# Extract the column width and column labels vectors from the xlsx
widths <- unname(sapply(CCLF_details,'[[',"COLUMN_WIDTH", drop = FALSE))
correct_widths <- lapply(widths,sum)
col_labels <- unname(sapply(CCLF_details,'[[',"CLAIM_FIELD_LABEL", drop = FALSE))

# Set up group names for the CCLF Files (CCLF1-CCLF9)
CCLF_files <- paste("CCLF",seq(1:9),sep = "")


proc_files <- function(f, w, y) {

  # Get files with Current CCLF# in name
  files <- list.files(pattern = f)

  # Build a list of data tables from all CCLF# files and pull in the proper widths and column names
  df_list <- lapply(files, function(x) read_fwf(x, fwf_widths(widths = w, col_names = y ), na = c("","NA","~","1000-01-01","9999-12-31")))

  # Bind all of the CCLF# files into one main file
  df <- rbindlist(df_list, fill = TRUE)
}

# Create a list of all the CCLF Files
df_list <- Map(proc_files,CCLF_files,as.vector(widths),as.vector(col_labels))
r dataframe fixed-width
1个回答
0
投票

当每个操作相同时,向量操作很好,但是当存在异常时,我建议使用for循环。像这样的东西。 filenaming计划让我困惑,所以修复'...'区域

filenames <- ...   
for(i in 1:length(filenames)){
  data = ... read in filenames[i]
  num_characters = ... 
  if(num_characters == 177){
    ... fix the width ... 
    df_list[[i]] = data
  }else{
    df_list[[i]] = data
  }
}
df <- rbindlist(df_list, fill = TRUE)
© www.soinside.com 2019 - 2024. All rights reserved.