如何使用R和dplyr将数据分组为未确定的时间间隔?

问题描述 投票:0回答:2

我有这种数据(明显简化):

Var1 Var2 Var3
20   0.4  a
50   0.5  a
80   0.6  b
150  0.3  a
250  0.4  b

如果它们落入间隔50,我想根据Var1对它们进行分组,然后获取Var1和Var2的均值,并在Var3是同质的情况下保持原样,或者如果该组具有混合标签,则将其重命名。在这种情况下,我会得到:

Var1 Var2 Var3
50   0.5  mixed
150  0.3  a
250  0.4  b

我猜我应该使用group_by包中的dplyr函数,但我不知道该怎么做。感谢您的帮助!

r dplyr tidy
2个回答
0
投票

这里是带有dput的数据框

d <- structure(list(Var1 = c(20L, 50L, 80L, 150L, 250L), Var2 = c(0.4, 
0.5, 0.6, 0.3, 0.4), Var3 = structure(c(1L, 1L, 2L, 1L, 2L), .Label = c("a", 
"b"), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))

我想

  1. 创建一些临时列以确定何时开始新组
  2. 分组并计算均值,还可以跟踪Var3的不同值
  3. 如果组中有多个Var3值,则更改为混合

在tidyverse中,这可能看起来像

d %>% 
 # make sure we sort Var1
 arrange(Var1) %>% 
 # increment var1 by 50 and test that against the next row
 # if the next value exceeds current by 50, we mark it as a new group
 mutate(nextint=Var1+50, 
       newgroup=Var1>lag(nextint,default=-Inf), 
       grp=cumsum(newgroup)) %>%
 # for each group, get the mean and a comma separated list of distinct Var3 values
 group_by(grp) %>% 
 summarise(
           grplbl=floor(max(Var1)/50)*50,
           mu=mean(Var2), 
           mix=paste(collapse=",",unique(Var3))) %>%
 # if mix (distinct Var3) has a comma in it, change from e.g. 'a,b' to 'mix'
 mutate(mix=ifelse(grepl(',', mix), 'mixed', mix))
# A tibble: 3 x 4
    grp grplbl    mu mix  
  <int>  <dbl> <dbl> <chr>
1     1     50   0.5 mixed
2     2    150   0.3 a    
3     3    250   0.4 b  

0
投票

另一种dplyr可能是:

df %>%
 group_by(grp = cumsum(Var1 - lag(Var1, default = first(Var1)) > 50)) %>%
 summarise(Var1 = mean(Var1),
           Var2 = mean(Var2),
           Var3 = ifelse(n_distinct(Var3) > 1, "mixed", Var3)) %>%
 ungroup() %>%
 select(-grp)

   Var1  Var2 Var3 
  <dbl> <dbl> <chr>
1    50   0.5 mixed
2   150   0.3 a    
3   250   0.4 b  
© www.soinside.com 2019 - 2024. All rights reserved.