我正在使用美国的冲突事件电子表格。每行代表一个事件,并包含地理和时间信息。冲突事件往往以“波浪”(相对紧密的时间分组)形式发生。我为每一个波浪生成了一个标识变量,并希望创建一个变量来测量这些冲突事件在每个波浪过程中的地理分布。
I wanted to do this in Excel,但很遗憾,我没有可用的动态数组公式。在升级到新版本的Excel之前,我想查看是否可以在R中使用。数据已经按区域,日期和波形进行了排序。
数据集的结构如下:
Country Region Date Event Wave
------- ------- ------ ------- ------
USA Vermont 5/1/2017 Strike Wave 1
USA Vermont 5/2/2017 Strike Wave 1
USA New Hamp. 5/3/2017 Strike Wave 1
USA Vermont 5/3/2017 Strike Wave 1
USA Maine 5/4/2017 Strike Wave 1
USA Washingt. 8/16/2018 Riot Wave 2
USA Washingt. 8/18/2018 Riot Wave 2
USA Oregon 8/18/2018 Protest Wave 2
USA Californ. 8/19/2018 Riot Wave 2
USA Nevada 8/20/2018 Protest Wave 2
USA Idaho 8/20/2018 Riot Wave 2
我想创建一个变量(“ geo_disp”),该变量记录给定wave中发生冲突的regions的数量。在整个浪潮中,我希望区域的数量会增加,并且我希望geo_disp变量记录下来。
您将注意到,当同一天发生两个事件但在不同位置时,两个事件都记录了区域总数。
这是我希望数据看起来像的样子:
Country Region Date Event Wave geo_disp
------- ------- ------ ------- ------ --------
USA Vermont 5/1/2017 Strike Wave 1 1
USA Vermont 5/2/2017 Strike Wave 1 1
USA New Hamp. 5/3/2017 Strike Wave 1 2
USA Vermont 5/3/2017 Strike Wave 1 2
USA Maine 5/4/2017 Strike Wave 1 3
USA Washingt. 8/16/2018 Riot Wave 2 1
USA Washingt. 8/18/2018 Riot Wave 2 2
USA Oregon 8/18/2018 Protest Wave 2 2
USA Californ. 8/19/2018 Riot Wave 2 3
USA Nevada 8/20/2018 Protest Wave 2 5
USA Idaho 8/20/2018 Riot Wave 2 5
如何使用R创建geo_disp变量?
先谢谢您-非常感谢。
如果您不介意在同一波中删除重复的区域,则可以使用tidyverse尝试这种方法:
library(tidyverse)
df <- tribble(
~Country, ~Region, ~Date, ~Event, ~Wave,
'USA', 'Vermont', '5/1/2017', 'Strike', 'Wave 1',
'USA', 'Vermont', '5/2/2017', 'Strike', 'Wave 1',
'USA', 'New Hamp.', '5/3/2017', 'Strike', 'Wave 1',
'USA', 'Vermont', '5/3/2017', 'Strike', 'Wave 1',
'USA', 'Maine', '5/4/2017', 'Strike', 'Wave 1',
'USA', 'Washingt.', '8/16/2018', 'Riot', 'Wave 2',
'USA', 'Washingt.', '8/18/2018', 'Riot', 'Wave 2',
'USA', 'Oregon', '8/18/2018', 'Protest', 'Wave 2',
'USA', 'Californ.', '8/19/2018', 'Riot', 'Wave 2',
'USA', 'Nevada', '8/20/2018', 'Protest', 'Wave 2',
'USA', 'Idaho', '8/20/2018', 'Riot', 'Wave 2'
)
df %>% distinct(Region, .keep_all = T) %>% group_by(Wave) %>% mutate(geo_disp = 1:n())
注意,dput()是使数据易于在R中共享的好方法。
> dput(df)
structure(list(Country = c("USA", "USA", "USA", "USA", "USA",
"USA", "USA", "USA", "USA", "USA", "USA"), Region = c("Vermont",
"Vermont", "New Hamp.", "Vermont", "Maine", "Washingt.", "Washingt.",
"Oregon", "Californ.", "Nevada", "Idaho"), Date = c("5/1/2017",
"5/2/2017", "5/3/2017", "5/3/2017", "5/4/2017", "8/16/2018",
"8/18/2018", "8/18/2018", "8/19/2018", "8/20/2018", "8/20/2018"
), Event = c("Strike", "Strike", "Strike", "Strike", "Strike",
"Riot", "Riot", "Protest", "Riot", "Protest", "Riot"), Wave = c("Wave 1",
"Wave 1", "Wave 1", "Wave 1", "Wave 1", "Wave 2", "Wave 2", "Wave 2",
"Wave 2", "Wave 2", "Wave 2")), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"))
保留整个数据集的dplyr解决方案。
df %>%
group_by(Wave) %>%
mutate(geo_disp = as.numeric(factor(Region, levels = unique(Region))))
#> # A tibble: 11 x 6
#> # Groups: Wave [2]
#> Country Region Date Event Wave geo_disp
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 USA Vermont 5/1/2017 Strike Wave_1 1
#> 2 USA Vermont 5/2/2017 Strike Wave_1 1
#> 3 USA New_Hamp. 5/3/2017 Strike Wave_1 2
#> 4 USA Vermont 5/3/2017 Strike Wave_1 1
#> 5 USA Maine 5/4/2017 Strike Wave_1 3
#> 6 USA Washingt. 8/16/2018 Riot Wave_2 1
#> 7 USA Washingt. 8/18/2018 Riot Wave_2 1
#> 8 USA Oregon 8/18/2018 Protest Wave_2 2
#> 9 USA Californ. 8/19/2018 Riot Wave_2 3
#> 10 USA Nevada 8/20/2018 Protest Wave_2 4
#> 11 USA Idaho 8/20/2018 Riot Wave_2 5
我们可以使用match
setDT(df)[, geo_disp := match(Region, unique(Region)), Wave]