我没有找到问题的确切解决方案,必须从不同来源收集信息才能找到可行的解决方案。我在这里记录(通过回答)以供参考。
问题陈述
我有两张桌子要加入 R:
以下代码花费了超过 1 分钟,并且经常会抛出“内存不足”错误:
library(arrow)
library(here)
library(DBI)
library(dplyr)
library(duckdb)
PATH_TABLE_1 <- here("/home/data/tbl_1")
PATH_CRM <- here("/home/data/crm")
ds_tbl_1 <-
open_dataset(PATH_TABLE_1, format = "parquet", partitioning="hive", hive_style = FALSE) |>
to_duckdb(con)
ds_crm <-
open_dataset(PATH_CRM, format = "parquet", partitioning="hive", hive_style = FALSE) |>
to_duckdb(con)
ds_tbl_1 |>
head(10) |>
left_join(
ds_crm,
join_by(phone_number == phone_number)
)
DBI::dbDisconnect(con)
注意,我仅使用 tbl_1(head)中的 10 条记录进行测试。 CRM 连接是 1-1,因此结果应该只有 10 条记录。
实际错误:
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_execute: Failed to run query
Error: Out of Memory Error: could not allocate block of size 262KB (215.8GB/215.8GB used)
Database is launched in in-memory mode and no temporary directory is specified.
Unused blocks cannot be offloaded to disk.
Launch the database with a persistent storage back-end
Or set PRAGMA temp_directory='/path/to/tmp.tmp'
Run `rlang::last_trace()` to see where the error occurred.
我没有尝试设置temp_directory,因为加入10条记录所需的时间太长。另外,我不知道如何通过流式传输将结果优雅地写入镶木地板。
解决方案
使用持久性 DuckDB 文件。
library(arrow)
library(here)
library(DBI)
library(dplyr)
library(duckdb)
library(glue)
PATH_TABLE_1 <- here("/home/data/tbl_1")
PATH_CRM <- here("/home/data/crm")
PATH_DUCKDB <- here("/home/data")
con <- DBI::dbConnect(duckdb(), dbdir = here(PATH_DUCKDB, "temp_db.duckdb")) #, read_only = TRUE)
DBI::dbSendQuery(
con,
glue::glue(
"DROP TABLE IF EXISTS tbl_1;
CREATE TABLE tbl_1 AS
SELECT * FROM read_parquet('{here(PATH_TABLE_1, '*/*.parquet')}')
"
)
)
DBI::dbSendQuery(
con,
glue::glue(
"DROP TABLE IF EXISTS crm;
CREATE TABLE crm AS
SELECT * FROM read_parquet('{here(PATH_CRM, '*/*/*.parquet')}')"
)
)
ds_tbl_1 <- tbl(con, "tbl_1")
ds_crm <- tbl(con, "crm")
df_joined <-
ds_tbl_1 |>
left_join(
ds_crm,
join_by(calling_pty == msisdn_hashed)
) |>
compute(name = "tbl_joined", temporary = FALSE)
tbl(con, "tbl_joined") |>
head()
我的个人执行时间是:
我能够对 tbl_joined 执行所需的分析。