标识组中重复值的第一个实例

问题描述 投票:2回答:1

我具有以下具有三列的数据帧-FIRMYEARDUMMY(0,1)。对于每个FIRM,我想扫描所有年份并确定第一种情况,其中DUMMY的值1重复多次(在连续的行中)。然后,我想创建一个新列,该列在DUMMY为1的所有年份中包含0,并在其前几年包含-1,-2,-3,并在这些年份包含1,2,3之后。

------------------------------
| FIRM | YEAR | DUMMY| NEW_COL
------------------------------
|  A   | 2006 |   0  |   0   |
------------------------------
|  A   | 2007 |   1  |   0   |
------------------------------
|  A   | 2008 |   0  |   0   |
------------------------------
|  B   | 2006 |   0  |   0   |
------------------------------
|  B   | 2007 |   0  |  -1   |
------------------------------
|  B   | 2008 |   1  |   0   |
------------------------------
|  B   | 2009 |   1  |   0   |
------------------------------
|  B   | 2010 |   0  |   1   |
------------------------------
|  B   | 2011 |   0  |   2   |
------------------------------
|  B   | 2012 |   1  |   3   |
------------------------------
|  B   | 2013 |   1  |   4   |
------------------------------
r dataframe sequence repeat
1个回答
1
投票

data.table解决方案。

根据您的描述,我认为B公司的2006年应为-2。

library(data.table)

dt <- fread(' FIRM  YEAR  DUMMY NEW_COL
 A  2006  0  0 
 A  2007  1  0 
 A  2008  0  0 
 B  2006  0  0 
 B  2007  0  -1 
 B  2008  1  0 
 B  2009  1  0 
 B  2010  0  1 
 B  2011  0  2 
 B  2012  1  3 
 B  2013  1  4 ')


dt[,c("flag","grp"):=.((.N>1) & (DUMMY==1),
                       .GRP),by=.(FIRM,rleid(DUMMY))]
dt
#>     FIRM YEAR DUMMY NEW_COL  flag grp
#>  1:    A 2006     0       0 FALSE   1
#>  2:    A 2007     1       0 FALSE   2
#>  3:    A 2008     0       0 FALSE   3
#>  4:    B 2006     0       0 FALSE   4
#>  5:    B 2007     0      -1 FALSE   4
#>  6:    B 2008     1       0  TRUE   5
#>  7:    B 2009     1       0  TRUE   5
#>  8:    B 2010     0       1 FALSE   6
#>  9:    B 2011     0       2 FALSE   6
#> 10:    B 2012     1       3  TRUE   7
#> 11:    B 2013     1       4  TRUE   7

dt[flag==TRUE,result:=fifelse(grp==min(grp),0,99),by=.(FIRM)]
dt
#>     FIRM YEAR DUMMY NEW_COL  flag grp result
#>  1:    A 2006     0       0 FALSE   1     NA
#>  2:    A 2007     1       0 FALSE   2     NA
#>  3:    A 2008     0       0 FALSE   3     NA
#>  4:    B 2006     0       0 FALSE   4     NA
#>  5:    B 2007     0      -1 FALSE   4     NA
#>  6:    B 2008     1       0  TRUE   5      0
#>  7:    B 2009     1       0  TRUE   5      0
#>  8:    B 2010     0       1 FALSE   6     NA
#>  9:    B 2011     0       2 FALSE   6     NA
#> 10:    B 2012     1       3  TRUE   7     99
#> 11:    B 2013     1       4  TRUE   7     99



dt[,result:=lapply(.SD,function(x){
  if (any(!is.na(x==0))){
    position_0_head <- head(which(x==0),1)
    position_0_tail <- tail(which(x==0),1)
    x[1:position_0_head] <- 0 - (YEAR[position_0_head]-YEAR[1:position_0_head])
    x[position_0_tail:length(x)] <- 0 + (YEAR[position_0_tail:length(x)]-YEAR[position_0_tail])
  } else{
    x <- 0
  }
  x
}),.SDcols="result",by=.(FIRM)]

dt[,.SD,.SDcols = !c("flag","grp")]
#>     FIRM YEAR DUMMY NEW_COL result
#>  1:    A 2006     0       0      0
#>  2:    A 2007     1       0      0
#>  3:    A 2008     0       0      0
#>  4:    B 2006     0       0     -2
#>  5:    B 2007     0      -1     -1
#>  6:    B 2008     1       0      0
#>  7:    B 2009     1       0      0
#>  8:    B 2010     0       1      1
#>  9:    B 2011     0       2      2
#> 10:    B 2012     1       3      3
#> 11:    B 2013     1       4      4

reprex package(v0.3.0)在2020-04-25创建

© www.soinside.com 2019 - 2024. All rights reserved.