关于在组内和在时间上产生滞后的两个问题。

Question

我有一个数据框架 x 像这样。

ID year    month     vol   sum_vol        
1   2000     1        1       6                 
1   2000     2        2       6                  
1   2000     3        3       6                  
1   2001     3        4       25                  
1   2001     4        5       25                  
1   2001     5        16      25                  
2   2000     1        7       24                
2   2000     2        8       24                 
2   2000     3        9       24                
2   2001     3        12      35                 
2   2001     4        11      35                 
2   2001     5        12      35                 
3   2000     1        13      42                 
3   2000     2        14      42                 
3   2000     3        15      42                 
3   2001     6        16      44          
3   2001     7        10      44
3   2001     8        18      44

和理想的输出。

ID year    month     vol   sum_vol        lag_year_sum_vol      lag_2_month_vol
1   2000     1        1       6                  NA                    NA
1   2000     2        2       6                  NA                    NA
1   2000     3        3       6                  NA                    1
1   2001     3        4       25                  6                    NA
1   2001     4        5       25                  6                    NA
1   2001     5        16      25                  6                    4
2   2000     1        7       24                 NA                    NA
2   2000     2        8       24                 NA                    NA
2   2000     3        9       24                 NA                    7
2   2001     3        12      35                 24                    NA
2   2001     4        11      35                 24                    NA
2   2001     5        12      35                 24                    12
3   2000     1        13      42                 NA                    NA
3   2000     2        14      42                 NA                    NA
3   2000     3        15      42                 NA                    13
3   2001     6        16      44                 42                    NA
3   2001     7        10      44                 42                    NA
3   2001     8        18      44                 42                    16

我找了很多，但没有得到结论。

所以你可以看到我的问题是：

1）如何为每个ID创建一个变量lag_year_sum_vol，其值为去年的sum_vol？

2）如何让每个客户每年有滞后的新变量lag_2_month_vol，按2个月计算？

注意真实数据中的ID、年、月可能不是这个排序顺序。而实际上对于年、月、vol的任意数字->数据中没有模式。

我更喜欢dplyr或data.table中的方法。似乎data.table更简洁）。

先谢谢你了!

Answer 1

这里有一个方法，使用 dplyr :

library(dplyr)

df %>%
  #arrange data by ID, year and month
  arrange(ID, year, month) %>%
  #group by ID
  group_by(ID) %>%
  #Get previous value of sum_vol
  mutate(lag_year_sum_vol = lag(sum_vol)) %>%
  #group by ID and year
  group_by(year, .add = TRUE) %>%
  #For older dplyr use
  #group_by(year, add = TRUE) %>%
  #get previous 2 months vol
  mutate(lag_2_month_vol = lag(vol, 2), 
  #Except 1st row in each group replace everything with NA
         lag_year_sum_vol = replace(lag_year_sum_vol, -1, NA)) %>%
  #Fill with 1st value in group
  tidyr::fill(lag_year_sum_vol)

返回

#      ID  year month   vol sum_vol lag_year_sum_vol lag_2_month_vol
#   <int> <int> <int> <int>   <int>            <int>           <int>
# 1     1  2000     1     1       6               NA              NA
# 2     1  2000     2     2       6               NA              NA
# 3     1  2000     3     3       6               NA               1
# 4     1  2001     3     4      25                6              NA
# 5     1  2001     4     5      25                6              NA
# 6     1  2001     5    16      25                6               4
# 7     2  2000     1     7      24               NA              NA
# 8     2  2000     2     8      24               NA              NA
# 9     2  2000     3     9      24               NA               7
#10     2  2001     3    12      35               24              NA
#11     2  2001     4    11      35               24              NA
#12     2  2001     5    12      35               24              12
#13     3  2000     1    13      42               NA              NA
#14     3  2000     2    14      42               NA              NA
#15     3  2000     3    15      42               NA              13
#16     3  2001     6    16      44               42              NA
#17     3  2001     7    10      44               42              NA
#18     3  2001     8    18      44               42              16

资料

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2000L, 2000L, 
2001L, 2001L, 2001L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 
2000L, 2000L, 2000L, 2001L, 2001L, 2001L), month = c(1L, 2L, 
3L, 3L, 4L, 5L, 1L, 2L, 3L, 3L, 4L, 5L, 1L, 2L, 3L, 6L, 7L, 8L
), vol = c(1L, 2L, 3L, 4L, 5L, 16L, 7L, 8L, 9L, 12L, 11L, 12L, 
13L, 14L, 15L, 16L, 10L, 18L), sum_vol = c(6L, 6L, 6L, 25L, 25L, 
25L, 24L, 24L, 24L, 35L, 35L, 35L, 42L, 42L, 42L, 44L, 44L, 44L
)), class = "data.frame", row.names = c(NA, -18L))

Answer 2

这是一个使用data.table的选项。

library(data.table)

# column 1
dt[dt[, .(ID, year = year +1, sum_vol)], on = .(ID, year), 
   lag_year_sum_vol := i.sum_vol]

# column 2
dt[dt[, .(ID, year, month = month+2, vol)], on = .(ID, year, month),
   lag_2_month_vol := i.vol]

如你所见，我在这两种情况下都临时修改了数据，加入并更新了原始数据。当然，使用data.table还有其他方法。

结果是

    ID year month vol sum_vol lag_year_sum_vol lag_2_month_vol
 1:  1 2000     1   1       6               NA              NA
 2:  1 2000     2   2       6               NA              NA
 3:  1 2000     3   3       6               NA               1
 4:  1 2001     3   4      25                6              NA
 5:  1 2001     4   5      25                6              NA
 6:  1 2001     5  16      25                6               4
 7:  2 2000     1   7      24               NA              NA
 8:  2 2000     2   8      24               NA              NA
 9:  2 2000     3   9      24               NA               7
10:  2 2001     3  12      35               24              NA
11:  2 2001     4  11      35               24              NA
12:  2 2001     5  12      35               24              12
13:  3 2000     1  13      42               NA              NA
14:  3 2000     2  14      42               NA              NA
15:  3 2000     3  15      42               NA              13
16:  3 2001     6  16      44               42              NA
17:  3 2001     7  10      44               42              NA
18:  3 2001     8  18      44               42              16

关于在组内和在时间上产生滞后的两个问题。

问题描述投票：0回答：1

1个回答

最新问题

关于在组内和在时间上产生滞后的两个问题。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1