这是我原来的 R/sqldf 代码:
首先我创建数据:
table_a <- data.frame(name = c('john', 'john', 'john', 'alex', 'alex', 'tim', 'tim', 'joe', 'joe', 'jessica', 'jessica'),
year = c(2010, 2011, 2012, 2020, 2021, 2015, 2016, 2010, 2011, 2000, 2001),
var = c('a', 'a', 'c', 'b', 'c', NA, NA, NA, NA, NA, NA))
table_b <- data.frame(name = c('sara', 'sara', 'tim', 'tim', 'tim', 'jessica'),
year = c(2001, 2002, 2005, 2006, 2021, 2020),
var = c('a', 'b', 'c', 'd', 'f', 'z'))
接下来,我运行代码:
library(sqldf)
sqldf("WITH min_year AS (
SELECT name
, MIN(year) AS min_year
FROM table_a
GROUP BY name
)
, b_filtered AS (
SELECT b.name
, MAX(b.year) AS max_year
, b.var
FROM table_b AS b
INNER JOIN min_year AS m
ON b.name = m.name
AND b.year < m.min_year
GROUP BY b.name
)
SELECT a.name
, a.year
, CASE WHEN a.var IS NULL AND b.name IS NOT NULL THEN b.var
ELSE a.var
END AS var_mod
FROM table_a AS a
LEFT JOIN b_filtered b
ON a.name = b.name")
是否可以将数据创建步骤和sql合并到同一段代码中?例如:
sqldf("WITH
table_a (name, year, var) AS
(
VALUES
('john', 2010, 'a' )
, ('john', 2011, 'a' )
, ('john', 2012, 'c' )
, ('alex', 2020, 'b' )
, ('alex', 2021, 'c' )
, ('tim', 2015, NULL)
, ('tim', 2016, NULL)
, ('joe', 2010, NULL)
, ('joe', 2011, NULL)
, ('jessica', 2000, NULL)
, ('jessica', 2001, NULL)
)
, table_b (name, year, var) AS
(
VALUES
('sara', 2001, 'a')
, ('sara', 2002, 'b')
, ('tim', 2005, 'c')
, ('tim', 2006, 'd')
, ('tim', 2021, 'f')
, ('jessica', 2020, 'z')
)
WITH min_year AS (
SELECT name
, MIN(year) AS min_year
FROM table_a
GROUP BY name
)
, b_filtered AS (
SELECT b.name
, MAX(b.year) AS max_year
, b.var
FROM table_b AS b
INNER JOIN min_year AS m
ON b.name = m.name
AND b.year < m.min_year
GROUP BY b.name
)
SELECT a.name
, a.year
, CASE WHEN a.var IS NULL AND b.name IS NOT NULL THEN b.var
ELSE a.var
END AS var_mod
FROM table_a AS a
LEFT JOIN b_filtered b
ON a.name = b.name")
虽然在 sqldf 语句之外创建数据然后运行 sqldf 代码工作得很好,但我只是想知道是否可以将它们组合成一段代码。
这使得测试和调试程序变得更加容易。
可以吗
问题是问题中的最后一个
sqldf
语句有语法错误。用逗号替换该语句中的第二个 with
。