我想使用
dbplyr
语法对某些表执行一些 JOIN
/ FILTER
操作并将结果存储回数据库而不先收集它。
从我读到的内容来看,
compute(..., temporary = FALSE, ...)
应该这样做,但是我很难为我想要存储的表提供完全限定的名称(即database.schema.table_name
)
我知道
DBI::Id
和dbplyr::in_schema
,但我不知道如何正确使用它们。尝试使用 sql
至少做了我想要的事情(创建了表格),但导致了(虚假?)错误。
我需要做什么?
library(DBI)
library(dbplyr)
con <- dbConnect(odbc::odbc(), "myserver")
## do __not__ collect the data
my_frame <- con %>%
tbl(Id(catalog = "mydb", schema = "dbo", table = "mytable")) %>%
inner_join(con %>% tbl(Id(catalog = "mydb", schema = "dbo",
table = "yetanothertable")),
"id")
compute(my_frame,
# Id(catalog = "mydb", schema = "dbo", table = "mynewtable"), # (1)
# in_schema("dbo", "mynewtable"), # (2),
sql("mydb.dbo.mynewtable"), # (3)
FALSE)
根据我使用的变体,我会得到不同的错误
# (1)
## Error in h(simpleError(msg, call)) :
## error in evaluating the argument 'conn' in selecting a method for function
## 'dbQuoteIdentifier': argument "con" is missing, with no default
# (2)
## Error in escape(x$schema, con = con) :
## argument "con" is missing, with no default
# (3)
## Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][SQL Server Native Client 11.0][SQL Server]Incorrect syntax near ')'.
## [Microsoft][SQL Server Native Client 11.0][SQL Server]Statement(s) could not be prepared.
## <SQL> 'SELECT *
## FROM (my.fully_qualified.name) "q02"
## WHERE (0 = 1)'
P.S.:我真的希望能够使用“完全”限定名称保存表,即包括数据库名称(尽管在这个简化示例中是相同的)。所以从长远来看,dbConnect(..., database = <somedb>)
并不能解决我的问题。
compute
解决方案。我知道我可以自己构建
SQL
,但我真的很想看看是否可以使用 dbplyr
抽象层。答案)。但是当您在问题中排除这种方法时,我测试并找到了一种无需编写 SQL 即可完成此操作的方法。 我们将使用
db_compute
代替
compute
。
compute
compute
编写永久表。db_compute
db_copy_to
旁边,其目的与我们正在寻找的类似。所以值得尝试(而且它有效)。
library(DBI)
library(dplyr)
library(dbplyr)
# connect to database
connection_string = "..."
db_connection = dbConnect(odbc::odbc(), .connection_string = connection_string)
# remote table
remote_table = tbl(db_connection, from = in_schema("schema","table"))
top_rows = remote_table %>%
head()
top_rows %>% show_query()
# <SQL>
# SELECT TOP (6) *
# FROM [database_name].[schema_name].[table_name]
top_rows = top_rows %>%
compute()
# Created a temporary table named: #dbplyr_002
top_rows %>% show_query()
# <SQL>
# SELECT *
# FROM #dbplyr_.002
所以我们可以看到
compute
写入了一个临时表。因此,如果我们进行一些复杂的处理(而不是只获取前几行)
compute
将是存储处理后的表的有效方法,这样我们就可以避免每次查询时重复复杂的处理。但是因为它是临时的,当我们与数据库断开连接时,该表应该消失:DBI::dbDisconnect(db_connection)
。
测试db_compute
out = db_compute(
con = db_connection,
table = in_schema("schema","new_table"),
sql = sql_render(top_rows),
temporary = FALSE
)
out
# <IDENT> database_name.schema_name.new_table
# reconnect
new_remote_table = tbl(db_connection, from = in_schema("schema","new_table"))
因此我们现在可以从 R 中访问新的(永久)表。我还使用 SQL 查询检查并确认该表存在于数据库中。注意
,由于 db_compute
的文档很少,因此尚不清楚是否打算以这种方式使用。我已经测试了上面的内容并且它有效。但如果没有额外的文档,请自行承担使用风险。
compute
,一切都会像魅力一样发挥作用。您只需提供数据库名称作为架构的一部分,并确保通过
sql
:进行转义
compute(my_frame,
# in_schema("mydb.dbo", "mynewtable") would __not__ work
in_schema(sql("mydb.dbo"), "mynewtable"),
FALSE)
事实证明,使用 pool::dbPool() 创建的连接无法计算,但使用 DBI::dbConnect() 创建的连接可以工作。
使用 pool 与 dbi 时工作/失败的示例代码
# libraries -----
library(dbplyr)
library(DBI)
library(pool)
library(dplyr)
library(odbc)
# connect using pool and con ----
pool <- pool::dbPool(drv = odbc::odbc(),
driver = "{ODBC17}",
uid = Sys.getenv("AZURE_UID"),
pwd = Sys.getenv("AZURE_PWD"),
server = Sys.getenv("AZURE_SERVER"),
database = Sys.getenv("AZURE_DATABASE"),
Authentication = "ActiveDirectoryPassword",
port = 1433,
minSize = 0, ## default is 1 which often leaves connections hanging (not cool)
idleTimeout = 600, ## 10 minutes until idle connection is closed, takes ~1 second to reopen
encoding = "UTF-8",
encrypt = "yes")
con <- DBI::dbConnect(
odbc::odbc(),
Driver = "{ODBC Driver 17 for SQL Server}",
Server = Sys.getenv("AZURE_SERVER"),
Database = Sys.getenv("AZURE_DATABASE"),
UID = Sys.getenv('AZURE_UID'),
PWD = Sys.getenv('AZURE_PWD'),
Port = 1433,
Authentication = "ActiveDirectoryPassword",
Encrypt = "yes"
)
# upload iris to synapse ----
#DBI::dbWriteTable(con, DBI::Id(schema = "my_schema", table = "iris"), value = iris, overwrite = TRUE) # slow but works
# POOL --this doesnt work -----
pool_iris <- tbl(pool, dbplyr::in_schema(sql("my_schema"),"iris"))
pool_query <- pool_iris %>%
head(10)
compute(pool_query, dbplyr::in_schema(sql("my_schema"),"iris_computed_from_pool"), temporary = FALSE)
# Error in UseMethod("sql_escape_ident") :
# no applicable method for 'sql_escape_ident' applied to an object of class "c('Pool', 'R6')"
# DBCONNECT - this works (same code) -----
con_iris <- tbl(con, dbplyr::in_schema(sql("my_schema"),"iris"))
con_query <- con_iris %>%
head(10)
compute(con_query, dbplyr::in_schema(sql("my_schema"),"iris_computed_from_con"), temporary = FALSE)
# Source: table<my_schema."iris_computed_from_con"> [10 x 5]
# # Database: Microsoft SQL Server 12.00.2531[[email protected]@sqlserversqldatabaseXXXX/sqldatabaseXXXX]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <chr>
# 1 4.6 3.1 1.5 0.2 setosa
# 2 5.5 3.5 1.3 0.2 setosa
# 3 4.9 3.6 1.4 0.1 setosa
# 4 6.4 3.1 5.5 1.8 virginica
# 5 4.4 3.2 1.3 0.2 setosa
sessioninfo::session_info()
# ─ Session info ───────────────────────────────────────────────────────────────
# setting value
# version R version 4.0.2 (2020-06-22)
# os OpenShift Enterprise
# system x86_64, linux-gnu
# ui RStudio
# language (EN)
# collate en_CA.UTF-8
# ctype en_CA.UTF-8
# tz America/Toronto
# date 2024-02-06
# rstudio 2023.06.0+421.pro1 Mountain Hydrangea (server)
# pandoc 3.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
#
# ─ Packages ───────────────────────────────────────────────────────────────────
# ! package * version date (UTC) lib source
# P bit 4.0.5 2022-11-15 [?] RSPM (R 4.0.5)
# P bit64 4.0.5 2020-08-30 [?] RSPM (R 4.0.5)
# P blob 1.2.4 2023-03-17 [?] RSPM (R 4.0.5)
# P cli 3.6.0 2023-01-09 [?] RSPM (R 4.0.5)
# P DBI * 1.1.3 2022-06-18 [?] CRAN (R 4.0.5)
# P dbplyr * 2.3.2 2023-03-21 [?] RSPM (R 4.0.5)
# P digest 0.6.31 2022-12-11 [?] RSPM (R 4.0.5)
# P dplyr * 1.1.0 2023-01-29 [?] RSPM (R 4.0.5)
# P evaluate 0.20 2023-01-17 [?] RSPM (R 4.0.5)
# P fansi 1.0.4 2023-01-22 [?] CRAN (R 4.0.2)
# P fastmap 1.1.1 2023-02-24 [?] RSPM (R 4.0.5)
# P generics 0.1.3 2022-07-05 [?] RSPM (R 4.0.5)
# P glue 1.6.2 2022-02-24 [?] RSPM (R 4.0.5)
# P hms 1.1.3 2023-03-21 [?] RSPM (R 4.0.5)
# P htmltools 0.5.7 2023-11-03 [?] RSPM (R 4.0.5)
# P knitr 1.42 2023-01-25 [?] RSPM (R 4.0.5)
# P later 1.3.0 2021-08-18 [?] RSPM (R 4.0.5)
# P lifecycle 1.0.3 2022-10-07 [?] RSPM (R 4.0.5)
# magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.0.5)
# P odbc * 1.4.2 2024-01-22 [?] RSPM (R 4.0.5)
# P pillar 1.8.1 2022-08-19 [?] RSPM (R 4.0.5)
# P pkgconfig 2.0.3 2019-09-22 [?] RSPM (R 4.0.3)
# P pool * 1.0.1 2023-02-21 [?] RSPM (R 4.0.5)
# P purrr 1.0.1 2023-01-10 [?] RSPM (R 4.0.5)
# R6 2.5.1 2021-08-19 [1] RSPM (R 4.0.5)
# P Rcpp 1.0.10 2023-01-22 [?] CRAN (R 4.0.2)
# renv 1.0.3 2023-09-19 [1] RSPM (R 4.0.2)
# P rlang 1.1.0 2023-03-14 [?] RSPM (R 4.0.5)
# P rmarkdown 2.20 2023-01-19 [?] RSPM (R 4.0.5)
# P rstudioapi 0.15.0 2023-07-07 [?] RSPM (R 4.0.5)
# P sessioninfo 1.2.2 2021-12-06 [?] RSPM (R 4.0.5)
# P tibble 3.2.1 2023-03-20 [?] RSPM (R 4.0.5)
# P tidyselect 1.2.0 2022-10-10 [?] RSPM (R 4.0.5)
# P utf8 1.2.3 2023-01-31 [?] RSPM (R 4.0.5)
# P vctrs 0.6.0 2023-03-16 [?] RSPM (R 4.0.5)
# P withr 2.5.0 2022-03-03 [?] RSPM (R 4.0.5)
# P xfun 0.37 2023-01-31 [?] RSPM (R 4.0.5)
# P yaml 2.3.7 2023-01-23 [?] RSPM (R 4.0.5)
#
# [1] /XXXX/renv/library/R-4.0/x86_64-pc-linux-gnu
# [2] /opt/R/4.0.2/lib/R/library
#
# P ── Loaded and on-disk path mismatch.