我有一个具有四列的数据框,第一个具有县名,第二个具有县名,第三个具有实际测量值(IPC类),第四个具有预测值(预测)在里面。实际值和预测值的范围都在1到5之间。这些是按县排序的数据框的前32行。:
structure(list(County = c("Baringo", "Baringo", "Baringo", "Baringo",
"Baringo", "Baringo", "Baringo", "Baringo", "Baringo", "Baringo",
"Baringo", "Baringo", "Baringo", "Baringo", "Baringo", "Baringo",
"Baringo", "Baringo", "Baringo", "Baringo", "Baringo", "Baringo",
"Baringo", "Baringo", "Baringo", "Baringo", "Baringo", "Baringo",
"Baringo", "Baringo", "Baringo", "Baringo"), `Period of measurement Kenya` = c("2011-01",
"2011-04", "2011-07", "2011-10", "2012-01", "2012-04", "2012-07",
"2012-10", "2013-01", "2013-04", "2013-07", "2013-10", "2014-01",
"2014-04", "2014-07", "2014-10", "2015-01", "2015-04", "2015-07",
"2015-10", "2016-02", "2016-06", "2016-10", "2017-02", "2017-06",
"2017-10", "2018-02", "2018-06", "2018-10", "2018-12", "2019-02",
"2019-06"), `IPC class` = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 3, 2, 1, 1, 1, 1, 1, 2
), Forecast = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 3, 1, 1, 1, 1, 2, 1)), row.names = c(1L,
48L, 95L, 142L, 189L, 236L, 283L, 330L, 377L, 424L, 471L, 518L,
565L, 612L, 659L, 706L, 753L, 800L, 847L, 894L, 941L, 988L, 1035L,
1082L, 1129L, 1176L, 1223L, 1270L, 1317L, 1364L, 1411L, 1458L
), class = "data.frame")
因此,对于我的报告,我需要知道我正在研究的时期内发生了多少次危机过渡以及有多少次错误预测的危机过渡。危机转移是指“实际值”列中的值从1或2变为3,4或5。在数据框的一部分中,您可以看到Baringo县发生了1次危机转移。为了计算这一点,使用了以下代码:
SUB_count_cristrans_KE <- long.SUB_dfCSKE_tot %>% mutate(crisis = ifelse(`IPC class` %in% 3:5, 1, 0)) %>%
arrange(County, `Period of measurement Kenya`) %>%
group_by(County) %>%
summarize(SUB_crisis_trans_count = sum(diff(crisis) > 0))
误认为危机过渡是指在发生危机过渡时,预测列与IPC类列的显示值不同。正如您在数据框的一部分中看到的,由于“预测”列中的值不是3、4或5,所以巴林哥的危机过渡是错误预测的。所以我的问题是:[C0中的正确条件是什么]功能是否可以按县减去错误的危机时期?换句话说:首先,它必须检查某个时期是否是危机过渡期,以便使其从1或2变为3,4或5。如果是这种情况,则预测列中的值是3 ,4或5。如果不是这种情况,那就是预料不到的危机过渡。我现在拥有的代码是:
ifelse
让我知道是否需要添加或澄清!预先感谢。
[下面我突出显示了加里萨郡(Garissa),以便更清楚地说明我想解决的问题或想要达到的目标。 ;)
SUB_count_crismiss_KE <- long.SUB_dfCSKE_tot %>% mutate(crisis_miss = ifelse(`IPC class` %in% 3:5 & (!Forecast %in% 3:5), 1, 0)) %>%
arrange(County, `Period of measurement Kenya`) %>%
group_by(County) %>%
summarize(SUB_crisis_miss_count_KE = sum(diff(crisis_miss) > 0))
[2011-04年至2011-07年之间发生了危机过渡; IPC值从2变为3。但是,在2011-07到2011-10期间没有发生危机过渡,因为IPC值保持在3。所以现在到了错误预测的部分。对上述时期之间的危机过渡进行了适当的预测;预测值为3、4或5。2011-10的预测值不正确,但是由于没有危机过渡,因此不应计算该值。那么,如何才能在不发生危机过渡的情况下跳过预测值呢?我希望现在更加清楚。
加里萨郡的dput子集:
> subset(sorted_long.SUB_dfCSKE_tot, County=="Garissa")
County Period of measurement Kenya IPC class Forecast
7 Garissa 2011-01 2 3
54 Garissa 2011-04 2 2
101 Garissa 2011-07 3 3
148 Garissa 2011-10 3 2
195 Garissa 2012-01 2 2
242 Garissa 2012-04 2 2
289 Garissa 2012-07 3 3
336 Garissa 2012-10 3 2
383 Garissa 2013-01 2 2
430 Garissa 2013-04 2 2
477 Garissa 2013-07 2 2
524 Garissa 2013-10 2 2
571 Garissa 2014-01 2 2
618 Garissa 2014-04 2 2
665 Garissa 2014-07 2 2
712 Garissa 2014-10 3 2
759 Garissa 2015-01 3 2
806 Garissa 2015-04 3 2
853 Garissa 2015-07 2 2
900 Garissa 2015-10 2 2
947 Garissa 2016-02 2 2
994 Garissa 2016-06 2 2
1041 Garissa 2016-10 2 2
1088 Garissa 2017-02 3 2
1135 Garissa 2017-06 3 3
1182 Garissa 2017-10 2 3
1229 Garissa 2018-02 3 2
1276 Garissa 2018-06 1 3
1323 Garissa 2018-10 1 1
1370 Garissa 2018-12 2 1
1417 Garissa 2019-02 2 2
1464 Garissa 2019-06 2 2
我现在创建了一个变量> copied_sorted_long <- dput(sorted_long.SUB_dfCSKE_tot[193:224,])
structure(list(County = c("Garissa", "Garissa", "Garissa", "Garissa",
"Garissa", "Garissa", "Garissa", "Garissa", "Garissa", "Garissa",
"Garissa", "Garissa", "Garissa", "Garissa", "Garissa", "Garissa",
"Garissa", "Garissa", "Garissa", "Garissa", "Garissa", "Garissa",
"Garissa", "Garissa", "Garissa", "Garissa", "Garissa", "Garissa",
"Garissa", "Garissa", "Garissa", "Garissa"), `Period of measurement Kenya` = c("2011-01",
"2011-04", "2011-07", "2011-10", "2012-01", "2012-04", "2012-07",
"2012-10", "2013-01", "2013-04", "2013-07", "2013-10", "2014-01",
"2014-04", "2014-07", "2014-10", "2015-01", "2015-04", "2015-07",
"2015-10", "2016-02", "2016-06", "2016-10", "2017-02", "2017-06",
"2017-10", "2018-02", "2018-06", "2018-10", "2018-12", "2019-02",
"2019-06"), `IPC class` = c(2, 2, 3, 3, 2, 2, 3, 3, 2, 2, 2,
2, 2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 2
), Forecast = c(3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2)), row.names = c(7L,
54L, 101L, 148L, 195L, 242L, 289L, 336L, 383L, 430L, 477L, 524L,
571L, 618L, 665L, 712L, 759L, 806L, 853L, 900L, 947L, 994L, 1041L,
1088L, 1135L, 1182L, 1229L, 1276L, 1323L, 1370L, 1417L, 1464L
), class = "data.frame")
,其中包含Garissa数据(为使名称保持简单)。然后,如果我对您的理解正确,那么您想在发生实际过渡时计算出一个误报only。如果没有过渡,按照定义,就不会有错误的预测(或者我们不在乎这些情况)。在那种情况下,我认为这可以满足您的需求(中间的data
部分和data1
当然可以组合在一根长管中)。同样,为清楚起见,下面的summary
数据帧与您通过data
提供的Garissa子集相同。
dput
下面的逻辑是,我们首先创建过渡和预测的过渡。然后,当且仅当存在过渡时,如果预报不预测过渡,我们才将其分类为误报。所有其他情况都被归类为“无误”。您不一定需要使用data1 <- data %>% mutate(crisis = ifelse(`IPC class` %in% 3:5, 1, 0)) %>%
arrange(County, `Period of measurement Kenya`) %>%
group_by(County) %>%
mutate(crisis_trans = (crisis - lag(crisis)) > 0,
crisis_trans_f = (Forecast - lag(Forecast)) > 0,
misforecast = case_when(
crisis_trans & crisis_trans_f ~ FALSE,
crisis_trans & !crisis_trans_f ~ TRUE,
TRUE ~ FALSE
))
summary <- data1 %>%
group_by(County) %>%
summarise(n_transitions = sum(crisis_trans, na.rm = TRUE),
n_misforecast = sum(misforecast))
,但我很喜欢它,因为很清楚了解发生了什么。