我目前正在尝试在计算机上打开48GB的CSV。不用说,我的RAM不支持如此大的文件,因此我试图在打开之前对其进行过滤。根据我的研究,在R中最合适的方法是使用sqldf
lib,更具体地说是read.csv.sql
函数:
df <- read.csv.sql('CIF_FOB_ITIC-en.csv', sql = "SELECT * FROM file WHERE 'Year' IN (2014, 2015, 2016, 2017, 2018)")
但是,我收到以下消息:
错误:重复的列名称:度量
由于SQL不区分大小写,具有两个变量,一个名为Measure,另一个名为MEASURE,意味着列名重复。为了解决这个问题,我尝试使用header = FALSE
参数并将'Year'
替换为V9
,从而产生以下错误:
connection_import_file中的错误(conn @ ptr,名称,值,sep,eol,跳过):RS_sqlite_import:CIF_FOB_ITIC-en.csv第2行预期为19列数据,但发现24
在这种情况下我应该如何进行?
提前感谢!
这是一个Tidyverse解决方案,它读取CSV的大块,对其进行过滤,然后将结果行堆叠起来。此代码也并行执行此操作,因此整个文件将被扫描,但比一次处理一个块要快得多(取决于您的核心数量),例如apply
(或purrr::map
) )。
内联评论。
library(tidyverse)
library(furrr)
# Make a CSV file out of the NASA stock dataset for demo purposes
raw_data_path <- tempfile(fileext = ".csv")
nasa %>% as_tibble() %>% write_csv(raw_data_path)
# Get the row count of the raw data, incl. header row, without loading the
# actual data
raw_data_nrow <- length(count.fields(raw_data_path))
# Hard-code the largest batch size you can, given your RAM in relation to the
# data size per row
batch_size <- 1e3
# Set up parallel processing of multiple chunks at a time, leaving one virtual
# core, as usual
plan(multiprocess, workers = availableCores() - 1)
filtered_data <-
# Define the sequence of start-point row numbers for each chunk (each number
# is actually the start point minus 1 since we're using the seq. no. as the
# no. of rows to skip)
seq(from = 0,
# Add the batch size to ensure that the last chunk is large enough to grab
# all the remainder rows
to = raw_data_nrow + batch_size,
by = batch_size) %>%
future_map_dfr(
~ read_csv(
raw_data_path,
skip = .x,
n_max = batch_size,
# Can't read in col. names in each chunk since they're only present in the
# 1st chunk
col_names = FALSE,
# This reads in each column as character, which is safest but slowest and
# most memory-intensive. If you're sure that each batch will contain
# enough values in each column so that the type detection in each batch
# will come to the same conclusions, then comment this out and leave just
# the guess_max
col_types = cols(.default = "c"),
guess_max = batch_size
) %>%
# This is where you'd insert your filter condition(s)
filter(TRUE),
# Progress bar! So you know how many chunks you have left to go
.progress = TRUE
) %>%
# The first row will be the header values, so set the column names to equal
# that first row, and then drop it
set_names(slice(., 1)) %>%
slice(-1)