我有这种数据(明显简化):
Var1 Var2 Var3
20 0.4 a
50 0.5 a
80 0.6 b
150 0.3 a
250 0.4 b
如果它们落入间隔50,我想根据Var1对它们进行分组,然后获取Var1和Var2的均值,并在Var3是同质的情况下保持原样,或者如果该组具有混合标签,则将其重命名。在这种情况下,我会得到:
Var1 Var2 Var3
50 0.5 mixed
150 0.3 a
250 0.4 b
我猜我应该使用group_by
包中的dplyr
函数,但我不知道该怎么做。感谢您的帮助!
这里是带有dput
的数据框
d <- structure(list(Var1 = c(20L, 50L, 80L, 150L, 250L), Var2 = c(0.4,
0.5, 0.6, 0.3, 0.4), Var3 = structure(c(1L, 1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
我想
在tidyverse中,这可能看起来像
d %>%
# make sure we sort Var1
arrange(Var1) %>%
# increment var1 by 50 and test that against the next row
# if the next value exceeds current by 50, we mark it as a new group
mutate(nextint=Var1+50,
newgroup=Var1>lag(nextint,default=-Inf),
grp=cumsum(newgroup)) %>%
# for each group, get the mean and a comma separated list of distinct Var3 values
group_by(grp) %>%
summarise(
grplbl=floor(max(Var1)/50)*50,
mu=mean(Var2),
mix=paste(collapse=",",unique(Var3))) %>%
# if mix (distinct Var3) has a comma in it, change from e.g. 'a,b' to 'mix'
mutate(mix=ifelse(grepl(',', mix), 'mixed', mix))
# A tibble: 3 x 4
grp grplbl mu mix
<int> <dbl> <dbl> <chr>
1 1 50 0.5 mixed
2 2 150 0.3 a
3 3 250 0.4 b
另一种dplyr
可能是:
df %>%
group_by(grp = cumsum(Var1 - lag(Var1, default = first(Var1)) > 50)) %>%
summarise(Var1 = mean(Var1),
Var2 = mean(Var2),
Var3 = ifelse(n_distinct(Var3) > 1, "mixed", Var3)) %>%
ungroup() %>%
select(-grp)
Var1 Var2 Var3
<dbl> <dbl> <chr>
1 50 0.5 mixed
2 150 0.3 a
3 250 0.4 b