对于每个单独的id，过滤在年份列中具有连续值的行

Question

我想从下面的数据框中创建一个平衡的面板数据：

id  program_year  value
1     2007         1
1     2008         1
1     2009         1
1     2010         1
1     2011         1
1     2012         1
1     2013         0
2     2007         0
2     2008         1
2     2009         1
2     2010         1
2     2011         1  
2     2012         1
2     2013         1
3     2007         1
3     2008         0
3     2009         1
3     2010         1
3     2011         1
3     2012         1
3     2013         1

对于每个

id

，我想在value == 1列中选择具有consecutive

program_year

的

行。

预期的输出应该是这样的：

id  program_year  value
1     2007         1
1     2008         1
1     2009         1
1     2010         1
1     2011         1
2     2008         1
2     2009         1
2     2010         1
2     2011         1  
2     2012         1
3     2009         1
3     2010         1
3     2011         1
3     2012         1
3     2013         1

我用

lead()

和

lag()

进行了探索，但没有成功。获得所需输出后的下一步是索引年份以使数据框成为平衡面板。

Answer 1

不确定这是否是您需要的：

Data <- "id  program_year  value
1     2007         1
1     2008         1
1     2009         1
1     2010         1
1     2011         1
1     2012         1
1     2013         0
2     2007         0
2     2008         1
2     2009         1
2     2010         1
2     2011         1  
2     2012         1
2     2013         1
3     2007         1
3     2008         0
3     2009         1
3     2010         1
3     2011         1
3     2012         1
3     2013         1"

DF <- read.table(text = Data, header = TRUE)


library(dplyr)

DF %>%
  arrange(id, program_year) %>%
  group_by(id) %>%
  filter((program_year - lag(program_year)) >= 1) %>% 
  mutate(consecutive = program_year - row_number()) %>%
  group_by(id, consecutive) %>%
  filter(n() >= 5) %>%
  slice_head(n = 5) %>%
  ungroup() %>%
  filter(value == 1) %>%
  select(id, program_year, value)

返回以下内容：

# A tibble: 14 × 3
  id program_year value


  <int>        <int> <int>
 1     1         2008     1
 2     1         2009     1
 3     1         2010     1
 4     1         2011     1
 5     1         2012     1
 6     2         2008     1
 7     2         2009     1
 8     2         2010     1
 9     2         2011     1
10     2         2012     1
11     3         2009     1
12     3         2010     1
13     3         2011     1
14     3         2012     1

修改答案以满足您的条件

where value == 1

Answer 2

这是一个使用

dplyr::consecutive_id()

和每个操作分组的解决方案。确保您使用的是最新版本的 dplyr。

library(dplyr) # >= v1.1.0

dat %>%
  mutate(c_id = consecutive_id(value), .by = id) %>%
  filter(value == 1, n() >= 5, .by = c(id, c_id)) %>%
  filter(row_number() <= 5, .by = id) %>%
  select(!c_id)

   id program_year value
1   1         2007     1
2   1         2008     1
3   1         2009     1
4   1         2010     1
5   1         2011     1
6   2         2008     1
7   2         2009     1
8   2         2010     1
9   2         2011     1
10  2         2012     1
11  3         2009     1
12  3         2010     1
13  3         2011     1
14  3         2012     1
15  3         2013     1

Answer 3

这里的技巧是创建两个虚拟列：首先是行号列，然后是逐列虚拟组。最后，您将从结果中取消选择虚拟列。

df %>% group_by(id) %>% arrange(program_year) %>% filter(value!=0) %>% mutate(row_num=row_number(),dummygroup=program_year-row_num)%>% group_by(id, dummygroup) %>% arrange(id,desc(program_year)) %>% filter(n()>=5,row_number()<=5)

Answer 4

在

by

中，对于每个 ID，我们首先可以

subset

为非零值，为连续值集创建一个组

，以及

subset

可能该组产生

which.max

数

table

中的观察结果。接下来我们

head

"by"

最小观察次数的类似列表

min

对象，最后是

rbind

.

by(dat, dat$id, \(x) {
  x <- subset(x, x$value == 1)
  u <- cumsum(c(1, diff(x$program_year)) != 1) + 1
  tbl <- table(u)
  subset(x, u == which.max(tbl))
}) |> {\(.) lapply(., \(x) {
  m <- min(sapply(., nrow))
  transform(head(x, m), period=seq_len(m))
})}() |>  ## or `tail` instead of `head`
  do.call(what='rbind')
#      id program_year value period
# 1.1   1         2007     1      1
# 1.2   1         2008     1      2
# 1.3   1         2009     1      3
# 1.4   1         2010     1      4
# 1.5   1         2011     1      5
# 2.9   2         2008     1      1
# 2.10  2         2009     1      2
# 2.11  2         2010     1      3
# 2.12  2         2011     1      4
# 2.13  2         2012     1      5
# 3.17  3         2009     1      1
# 3.18  3         2010     1      2
# 3.19  3         2011     1      3
# 3.20  3         2012     1      4
# 3.21  3         2013     1      5

在给定 ID 的情况下给出连续观察的最大可能子集，尽管在 OP 中要求具有不匹配年，但添加了新的周期变量。

资料：

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), program_year = c(2007L, 
2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2007L, 2008L, 2009L, 
2010L, 2011L, 2012L, 2013L, 2007L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L), value = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-21L))

对于每个单独的id，过滤在年份列中具有连续值的行

问题描述投票：0回答：4

4个回答

最新问题

对于每个单独的id，过滤在年份列中具有连续值的行

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4