使用 JOINS 在 SQL 中添加缺失的行

问题描述 投票:0回答:1

我在 R 中有这个数据集:

name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
source = "ORIGINAL"

my_data = data.frame(name, year, age, gender, country_of_birth, source)

正如我们所看到的,这个数据集中的一些人有缺失年份的行(例如约翰从 2011 年到 2014 年):

   name year age gender country_of_birth   source
1  john 2010  21   male        australia ORIGINAL
2  john 2011  22   male        australia ORIGINAL
3  john 2014  25   male        australia ORIGINAL
4 sarah 2010  55 female               uk ORIGINAL
5 sarah 2015  60 female               uk ORIGINAL
6 peter 2011  61   male           mexico ORIGINAL
7 peter 2012  62   male           mexico ORIGINAL
8 peter 2013  63   male           mexico ORIGINAL
9 peter 2015  65   male           mexico ORIGINAL

我有这段代码能够通过“插入”缺失行的逻辑值来添加这些缺失的行(例如年龄增加 1,country_of_birth 保持不变等),并记录该行是后来添加的还是原始的:

library(tidyverse)
library(dplyr)

# R Code to Convert into SQL
final = my_data %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))

# A tibble: 16 x 6
# Groups:   name [3]
   name   year   age gender country_of_birth source      
   <chr> <dbl> <dbl> <chr>  <chr>            <chr>       
 1 john   2010    21 male   australia        ORIGINAL    
 2 john   2011    22 male   australia        ORIGINAL    
 3 john   2012    23 male   australia        NOT ORIGINAL

我的问题: 我正在尝试学习如何将上面的代码转换为 (Netezza) SQL 代码。

为了了解如何开始,我想我可以使用 R 中的“dbplyr”库将我的“dplyr”代码转换为 SQL:

library(dbplyr)

# attempt 1
remote_df = tbl_lazy(my_data, con = simulate_mysql())

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

# attempt 2
remote_df = tbl_lazy(my_data, con = simulate_mssql())

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

# attempt 3

 con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

 remote_df <- copy_to(con, my_data)

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))



# attempt 4

memdb_frame(my_data) %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

但是所有这些尝试都给我同样的错误:

Error in `fill()`:
x `.data` does not have explicit order.
i Please use `arrange()` or `window_order()` to make determinstic.
Run `rlang::last_error()` to see where the error occurred.

有人可以告诉我我做错了什么以及我可以做些什么来将此 R 代码转换为 SQL 代码吗? 我曾想也许我可以找出哪个人缺少哪些行,创建这些行 - 然后以某种方式使用 JOINS 将它们带回 SQL 中的原始数据集。

谢谢!

sql r join data-manipulation
1个回答
1
投票

在 SQL 中,您可以通过

cross join
创建行的“笛卡尔积”。我相信 R 中的等价物是 merge() 。对于缺失的年份,在 SQL 中,您需要一个表或结果集,但您应该能够在 R

中使用序列

合并一系列年份和 merge() 函数:

library(tidyverse)
library(dplyr)

# Create a data frame with the sequence of years
years_df <- data.frame(year = seq(2010, 2023))

# Perform a cross join with the original data
final <- merge(my_data, years_df, all = TRUE) %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))

以上未经测试!

SQL代码:

CREATE TABLE mytable (
    name VARCHAR(255),
    year INTEGER,
    age INTEGER,
    gender VARCHAR(255),
    country_of_birth VARCHAR(255),
    source VARCHAR(255)
);

INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2010, 21, 'male', 'australia', 'ORIGINAL');
INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2011, 22, 'male', 'australia', 'ORIGINAL');
etc.

交叉连接示例查询:

WITH RECURSIVE years (year) AS (
    SELECT 2010
    UNION ALL
    SELECT year + 1
    FROM years
    WHERE year < 2023
)
SELECT 
    t.name, years.year, t.age, t.gender, t.country_of_birth, t.source
FROM mytable AS t
CROSS JOIN years;
© www.soinside.com 2019 - 2024. All rights reserved.