我需要为这些多个ID组合一些列,并且可以使用第一个ID列表中的值来存储其他ID。例如,在这里我只想结合“消费”专栏和心脏病发作专栏来说明他们是否曾经心脏病发作。然后,我想要删除重复的ID#s,并保留其他列的第一个列表中的值:
df <- read.table(text =
"ID Age Gender heartattack spending
1 24 f 0 140
2 24 m na 123
2 24 m 1 58
2 24 m 0 na
3 85 f 1 170
4 45 m na 204", header=TRUE)
我需要的:
df2 <- read.table(text =
"ID Age Gender ever_heartattack all_spending
1 24 f 0 140
2 24 m 1 181
3 85 f 1 170
4 45 m na 204", header=TRUE)
我用transmute()和sum()尝试了group_by,如下所示:
df$heartattack = as.numeric(as.character(df$heartattack))
df$spending = as.numeric(as.character(df$spending))
library(dplyr)
df = df %>% group_by(ID) %>% transmute(ever_heartattack = sum(heartattack, na.rm = T), all_spending = sum(spending, na.rm=T))
但这会删除所有其他列!此外,它将NA值转换为零...例如,我仍然希望“NA”为患者ID#4的值,我不想更改数据,说他们从未心脏病发作!
> print(dfa) #This doesn't at all match df2 :(
ID ever_heartattack all_spending
1 1 0 140
2 2 1 181
3 2 1 181
4 2 1 181
5 3 1 170
6 4 0 204
你能这样做吗?
aggregate(
spending ~ ID + Age + Gender,
data = transform(df, spending = as.numeric(as.character(spending))),
FUN = sum)
# ID Age Gender spending
#1 1 24 f 140
#2 3 85 f 170
#3 2 24 m 181
#4 4 45 m 204
一些评论:
heartattack
)。例如,对于ID = 2
,为什么你保留heartattack = 1
而不是heartattack = na
或heartattack = 0
?"na"
s实际上不是真正的NA
s。这导致spending
是factor
列而不是numeric
列向量。要完全重现您的预期输出,我们可以做到
df %>%
mutate(
heartattack = as.numeric(as.character(heartattack)),
spending = as.numeric(as.character(spending))) %>%
group_by(ID, Age, Gender) %>%
summarise(
heartattack = ifelse(
any(heartattack %in% c(0, 1)),
max(heartattack, na.rm = T),
NA),
spending = sum(spending, na.rm = T))
## A tibble: 4 x 5
## Groups: ID, Age [?]
# ID Age Gender heartattack spending
# <int> <int> <fct> <dbl> <dbl>
#1 1 24 f 0 140
#2 2 24 m 1 181
#3 3 85 f 1 170
#4 4 45 m NA 204
由于规则不明确heartattack
值得保留,这感觉有点“hacky”。在这种情况下我们
heartattack
包含0或1,则保持heartattack
的最大值。NA
不包含0或1,则返回heartattack
。