如何在打开R之前在R中过滤非常大的csv？

Question

我目前正在尝试在计算机上打开48GB的CSV。不用说，我的RAM不支持如此大的文件，因此我试图在打开之前对其进行过滤。根据我的研究，在R中最合适的方法是使用sqldf lib，更具体地说是read.csv.sql函数：

df <- read.csv.sql('CIF_FOB_ITIC-en.csv', sql = "SELECT * FROM file WHERE 'Year' IN (2014, 2015, 2016, 2017, 2018)")

但是，我收到以下消息：

错误：重复的列名称：度量

由于SQL不区分大小写，具有两个变量，一个名为Measure，另一个名为MEASURE，意味着列名重复。为了解决这个问题，我尝试使用header = FALSE参数并将'Year'替换为V9，从而产生以下错误：

connection_import_file中的错误（conn @ ptr，名称，值，sep，eol，跳过）：RS_sqlite_import：CIF_FOB_ITIC-en.csv第2行预期为19列数据，但发现24

在这种情况下我应该如何进行？

提前感谢！

Answer 1

这是一个Tidyverse解决方案，它读取CSV的大块，对其进行过滤，然后将结果行堆叠起来。此代码也并行执行此操作，因此整个文件将被扫描，但比一次处理一个块要快得多（取决于您的核心数量），例如apply（或purrr::map））。

内联评论。

library(tidyverse)
library(furrr)

# Make a CSV file out of the NASA stock dataset for demo purposes
raw_data_path <- tempfile(fileext = ".csv")
nasa %>% as_tibble() %>% write_csv(raw_data_path)

# Get the row count of the raw data, incl. header row, without loading the
# actual data
raw_data_nrow <- length(count.fields(raw_data_path))

# Hard-code the largest batch size you can, given your RAM in relation to the
# data size per row
batch_size    <- 1e3 

# Set up parallel processing of multiple chunks at a time, leaving one virtual
# core, as usual
plan(multiprocess, workers = availableCores() - 1)

filtered_data <- 
  # Define the sequence of start-point row numbers for each chunk (each number
  # is actually the start point minus 1 since we're using the seq. no. as the
  # no. of rows to skip)
  seq(from = 0, 
      # Add the batch size to ensure that the last chunk is large enough to grab
      # all the remainder rows
      to = raw_data_nrow + batch_size, 
      by = batch_size) %>% 
  future_map_dfr(
    ~ read_csv(
      raw_data_path,
      skip      = .x,
      n_max     = batch_size, 
      # Can't read in col. names in each chunk since they're only present in the
      # 1st chunk
      col_names = FALSE,
      # This reads in each column as character, which is safest but slowest and
      # most memory-intensive. If you're sure that each batch will contain
      # enough values in each column so that the type detection in each batch
      # will come to the same conclusions, then comment this out and leave just
      # the guess_max
      col_types = cols(.default = "c"),
      guess_max = batch_size
    ) %>% 
      # This is where you'd insert your filter condition(s)
      filter(TRUE),
    # Progress bar! So you know how many chunks you have left to go
    .progress = TRUE
  ) %>% 
  # The first row will be the header values, so set the column names to equal
  # that first row, and then drop it
  set_names(slice(., 1)) %>% 
  slice(-1)

如何在打开R之前在R中过滤非常大的csv？

问题描述投票：0回答：1

1个回答

最新问题

如何在打开R之前在R中过滤非常大的csv？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1