作为树模型的输入,我在SQL中创建了一个分析表。现在,我想将其传输到R,因为以该表为输入的模型也在R中运行。我无法转换为R的SQL步骤之一。
分析表具有以下形式:
df <- data.frame(
pseudonym = c("a", "a", "a", "b", "c", "c"),
var1 = c(1,1,0,1,1,0),
var2 = c(1,0,0,0,0,1),
var3 = c(0,0,0,0,0,1))
> df
pseudonym var1 var2 var3
1 a 1 1 0
2 a 1 0 0
3 a 0 0 0
4 b 1 0 0
5 c 1 0 0
6 c 0 1 1
在下一步中,我需要假名的不同行,同时保留其他列var1,var2,var3中的信息(1)。 (在SQL中,这是通过max(case when...then 1 else 0 end) as var1
)
因此从df1创建的结果df2应该是
df2 <- data.frame(
pseudonym = c("a", "b", "c"),
var1 = c(1,1,1),
var2 = c(1,0,1),
var3 = c(0,0,1))
> df2
pseudonym var1 var2 var3
1 a 1 1 0
2 b 1 0 0
3 c 1 1 1
如果有人有一个主意,那将非常有帮助。
这里是一种方式:
library(dplyr)
library(tidyr)
df <- data.frame(
pseudonym = c("a", "a", "a", "b", "c", "c"),
var1 = c(1,1,0,1,1,0),
var2 = c(1,0,0,0,0,1),
var3 = c(0,0,0,0,0,1))
df %>%
pivot_longer(cols = var1:var3) %>%
group_by(pseudonym, name) %>%
filter(max(value) == value) %>%
ungroup() %>%
distinct() %>%
pivot_wider(names_from = name, values_from = value)
#># A tibble: 3 x 4
#> pseudonym var1 var2 var3
#> <fct> <dbl> <dbl> <dbl>
#>1 a 1 1 0
#>2 b 1 0 0
#>3 c 1 1 1
另一种dplyr方法,可能不太复杂,但可以起作用:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(
pseudonym = c("a", "a", "a", "b", "c", "c"),
var1 = c(1,1,0,1,1,0),
var2 = c(1,0,0,0,0,1),
var3 = c(0,0,0,0,0,1)); df
#> pseudonym var1 var2 var3
#> 1 a 1 1 0
#> 2 a 1 0 0
#> 3 a 0 0 0
#> 4 b 1 0 0
#> 5 c 1 0 0
#> 6 c 0 1 1
df2 <- df %>% group_by(pseudonym) %>% mutate(var1 = case_when(1 %in% var1 ~ 1),
var2 = case_when(1 %in% var2 ~ 1),
var3 = case_when(1 %in% var3 ~ 1)) %>%
unique() %>%
ungroup() #creates "NA" in place of "0"
df2[is.na(df2)] <- 0; df2 #"NA"s are taken care of
#> # A tibble: 3 x 4
#> pseudonym var1 var2 var3
#> <fct> <dbl> <dbl> <dbl>
#> 1 a 1 1 0
#> 2 b 1 0 0
#> 3 c 1 1 1
由reprex package(v0.3.0)在2020-04-21创建