我有几个关于特定国家对及其 1870-2020 年贸易量的单独 csv 文件(使用 COW 贸易数据集,此处为 smoothtotrade 变量)。不幸的是,该数据集仅在 2014 年之前可用,因此所有其他值均设置为 NA。
在尝试了多种方法来估算/预测缺失的数据后,我决定最好只保留最后一个可用值(即 2014 年的 smoothtotrade)。但是,我无法让它工作。我在这里一直使用 imputeTS 包,使用 na_locf 函数。有人可以帮我吗?
数据帧列表称为data_frames。我当前的代码:
library(imputeTS)
*Imputation function using carry forward of the average of the last three non-missing values*
impute_smoothtotrade <- function(ts_data) {
ts_data_imputed <- na.locf(ts_data, option = "locf")
return(ts_data_imputed)
}
*Loop through each data frame (time series) in the list*
for (i in seq_along(data_frames)) {
data_frames[[i]]$smoothtotrade <- impute_smoothtotrade(data_frames[[i]]$smoothtotrade)
}
这是随机国家对的结果,清楚地表明 2014 年的值显然没有按预期进行。
51 AUT CMR 2010 11.484859
52 AUT CMR 2011 10.393110
53 AUT CMR 2012 6.902980
54 AUT CMR 2013 4.058900
55 AUT CMR 2014 9.018300
89 AUT CMR 2015 2.582298
90 AUT CMR 2016 2.582298
91 AUT CMR 2017 2.582298
92 AUT CMR 2018 2.582298
93 AUT CMR 2019 2.582298
94 AUT CMR 2020 2.582298
两个(众多)选项:
样本数据
# Sample dataframes and data_frame list
df1 <- data.frame(country = c(rep("AAA", 11)), year = 2010:2020,
smoothtotrade = c(11.484859, 10.393110, 6.902980, 4.058900, 9.018300, rep(NA, 6)))
df2 <- data.frame(country = c(rep("BBB", 11)), year = 2010:2020,
smoothtotrade = c(12.484859, 1.393110, 3.902980, 8.058900, 5.018300, rep(NA, 6)))
df3 <- data.frame(country = c(rep("CCC", 11)), year = 2010:2020,
smoothtotrade = c(8.484859, 9.393110, 10.902980, 9.058900, 8.018300, rep(NA, 6)))
data_frames <- list(df1, df2, df3)
选项 1:使用
dplyr
和 tidyr
包
library(dplyr)
library(tidyr)
# Single df with all dataframes
df4 <- bind_rows(data_frames, .id = "column_label")
result <- df4 %>%
group_by(country) %>%
fill(smoothtotrade, .direction = c("down")) %>%
ungroup()
result
# A tibble: 33 × 4
column_label country year smoothtotrade
<chr> <chr> <int> <dbl>
1 1 AAA 2010 11.5
2 1 AAA 2011 10.4
3 1 AAA 2012 6.90
4 1 AAA 2013 4.06
5 1 AAA 2014 9.02
6 1 AAA 2015 9.02
7 1 AAA 2016 9.02
8 1 AAA 2017 9.02
9 1 AAA 2018 9.02
10 1 AAA 2019 9.02
# ℹ 23 more rows
# ℹ Use `print(n = ...)` to see more rows
选项 2:使用原来的方法
for (i in seq_along(data_frames)) {
data_frames[[i]]$smoothtotrade <-
ifelse(is.na(data_frames[[i]]$smoothtotrade),
data_frames[[i]]$smoothtotrade[max(which(!is.na(data_frames[[i]]$smoothtotrade)))],
data_frames[[i]]$smoothtotrade)
}
data_frames
[[1]]
country year smoothtotrade
1 AAA 2010 11.48486
2 AAA 2011 10.39311
3 AAA 2012 6.90298
4 AAA 2013 4.05890
5 AAA 2014 9.01830
6 AAA 2015 9.01830
7 AAA 2016 9.01830
8 AAA 2017 9.01830
9 AAA 2018 9.01830
10 AAA 2019 9.01830
11 AAA 2020 9.01830
[[2]]
country year smoothtotrade
1 BBB 2010 12.48486
2 BBB 2011 1.39311
3 BBB 2012 3.90298
4 BBB 2013 8.05890
5 BBB 2014 5.01830
6 BBB 2015 5.01830
7 BBB 2016 5.01830
8 BBB 2017 5.01830
9 BBB 2018 5.01830
10 BBB 2019 5.01830
11 BBB 2020 5.01830
[[3]]
country year smoothtotrade
1 CCC 2010 8.484859
2 CCC 2011 9.393110
3 CCC 2012 10.902980
4 CCC 2013 9.058900
5 CCC 2014 8.018300
6 CCC 2015 8.018300
7 CCC 2016 8.018300
8 CCC 2017 8.018300
9 CCC 2018 8.018300
10 CCC 2019 8.018300
11 CCC 2020 8.018300