如何在打开R之前在R中过滤非常大的csv?

问题描述 投票:0回答:1

我目前正在尝试在计算机上打开48GB的CSV。不用说,我的RAM不支持如此大的文件,因此我试图在打开之前对其进行过滤。根据我的研究,在R中最合适的方法是使用sqldf lib,更具体地说是read.csv.sql函数:

df <- read.csv.sql('CIF_FOB_ITIC-en.csv', sql = "SELECT * FROM file WHERE 'Year' IN (2014, 2015, 2016, 2017, 2018)")

但是,我收到以下消息:

错误:重复的列名称:度量

由于SQL不区分大小写,具有两个变量,一个名为Measure,另一个名为MEASURE,意味着列名重复。为了解决这个问题,我尝试使用header = FALSE参数并将'Year'替换为V9,从而产生以下错误:

connection_import_file中的错误(conn @ ptr,名称,值,sep,eol,跳过):RS_sqlite_import:CIF_FOB_ITIC-en.csv第2行预期为19列数据,但发现24

在这种情况下我应该如何进行?

提前感谢!

sql r csv read.csv
1个回答
1
投票

这是一个Tidyverse解决方案,它读取CSV的大块,对其进行过滤,然后将结果行堆叠起来。此代码也并行执行此操作,因此整个文件将被扫描,但比一次处理一个块要快得多(取决于您的核心数量),例如apply(或purrr::map) )。

内联评论。

library(tidyverse)
library(furrr)

# Make a CSV file out of the NASA stock dataset for demo purposes
raw_data_path <- tempfile(fileext = ".csv")
nasa %>% as_tibble() %>% write_csv(raw_data_path)

# Get the row count of the raw data, incl. header row, without loading the
# actual data
raw_data_nrow <- length(count.fields(raw_data_path))

# Hard-code the largest batch size you can, given your RAM in relation to the
# data size per row
batch_size    <- 1e3 

# Set up parallel processing of multiple chunks at a time, leaving one virtual
# core, as usual
plan(multiprocess, workers = availableCores() - 1)

filtered_data <- 
  # Define the sequence of start-point row numbers for each chunk (each number
  # is actually the start point minus 1 since we're using the seq. no. as the
  # no. of rows to skip)
  seq(from = 0, 
      # Add the batch size to ensure that the last chunk is large enough to grab
      # all the remainder rows
      to = raw_data_nrow + batch_size, 
      by = batch_size) %>% 
  future_map_dfr(
    ~ read_csv(
      raw_data_path,
      skip      = .x,
      n_max     = batch_size, 
      # Can't read in col. names in each chunk since they're only present in the
      # 1st chunk
      col_names = FALSE,
      # This reads in each column as character, which is safest but slowest and
      # most memory-intensive. If you're sure that each batch will contain
      # enough values in each column so that the type detection in each batch
      # will come to the same conclusions, then comment this out and leave just
      # the guess_max
      col_types = cols(.default = "c"),
      guess_max = batch_size
    ) %>% 
      # This is where you'd insert your filter condition(s)
      filter(TRUE),
    # Progress bar! So you know how many chunks you have left to go
    .progress = TRUE
  ) %>% 
  # The first row will be the header values, so set the column names to equal
  # that first row, and then drop it
  set_names(slice(., 1)) %>% 
  slice(-1)
© www.soinside.com 2019 - 2024. All rights reserved.