我有一个看起来像这样的数据集:
set.seed(2)
origin <- rep(c("DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR"), 2)
year <- rep(c(1998,1998,1998,1998,1998,1998,1998,1998,1998,1998,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000), 2)
value <- sample(1:10000, size=length(origin), replace=TRUE)
test.df <- as.data.frame(cbind(origin, year, value))
rm(origin, year, value)
然后我有2个列表。
第一个是按照ISOcodes
图书馆建立的地区国家列表如下:
library("ISOcodes")
list.continent <- list(asia = c("Central Asia", "Eastern Asia", "South-eastern Asia", "Southern Asia", "Western Asia"),
africa = c("Northern Africa", "Sub-Saharan Africa", "Eastern Africa", "Middle Africa", "Southern Africa", "Western Africa"),
europe = c("Eastern Europe", "Northern Europe", "Channel Islands", "Southern Europe", "Western Europe"),
oceania = c("Australia and New Zealand", "Melanesia", "Micronesia", "Polynesia"),
northamerica = c("Northern America"),
latinamerica = c("South America", "Central America", "Caribbean"))
country.list.continent <- sapply(list.continent, function(item) {
region <- subset(UN_M.49_Regions, Name %in% item)
sub <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
return(sub$ISO_Alpha_3)
}, simplify = FALSE)
rm(list.continent)
以及其他年份列表:
year.list <- levels(as.factor(unique(test.df$year)))
我想填充一个矩阵,其中包含与特定年份的精确区域相对应的计算数字。矩阵如下:
ncol <- length(year.list)
nrow <- length(country.list.continent)
matrix.extraction <- matrix(, nrow = nrow, ncol = ncol)
rownames(matrix.extraction) <- names(country.list.continent)
colnames(matrix.extraction) <- year.list
为了进行我的计算,我有一个循环,能够将数据集子集太大,否则......循环基于年份(相当于colnames(matrix.extraction)
)。我们的想法是计算每年每个国家/地区的价值(以%表示)。计算部分足够简单并且运行良好。当我需要将值归因于每一行时,我的问题出现了。
for(i in 1:length(colnames(matrix.extraction))){
### I subset and compute what I want
table.temp <- test.df %>%
subset(year == colnames(matrix.extraction)[i]) %>%
group_by(origin) %>%
summarise(value = sum(value, na.rm = TRUE))
table.temp$percent <- prop.table(table.temp$value)
### then I need to attribute the wanted values
matrix.extraction["ROWNAME",i] <- table.temp %>%
subset(origin %in% country.list.continent$"ROWNAME") %>%
summarise(. ,sum = sum(percent)))
}
我真的不知道我怎么能这样做。
预期的结果是一个矩阵,如:
1998 2000
asia here NA
africa NA NA
europe NA NA
oceania NA NA
northamerica NA NA
latinamerica NA NA
与[1,1]中的“here”相比,colname中年份的rowname中每个国家/地区的值的总和。
任何帮助,将不胜感激。
使用双sapply
我们可以循环year.list
和country.list.continent
的所有组合,并为每个组合计算sum
的value
。
sapply(year.list, function(x) sapply(names(country.list.continent), function(y) {
with(test.df, sum(value[origin %in% country.list.continent[[y]] & year == x]))
}))
# 1998 2000
#asia 21759 20059
#africa 0 0
#europe 39700 35981
#oceania 0 0
#northamerica 21347 17324
#latinamerica 10847 8672
如果我们对tidyverse
解决方案感兴趣
library(tidyverse)
crossing(x = year.list, y = names(country.list.continent)) %>%
mutate(sum = map2_dbl(x, y, ~
test.df %>%
filter(year == .x & origin %in% country.list.continent[[.y]]) %>%
summarise(total = sum(value)) %>%
pull(total)))
# x y sum
# <chr> <chr> <dbl>
# 1 1998 africa 0
# 2 1998 asia 21759
# 3 1998 europe 39700
# 4 1998 latinamerica 10847
# 5 1998 northamerica 21347
# 6 1998 oceania 0
# 7 2000 africa 0
# 8 2000 asia 20059
# 9 2000 europe 35981
#10 2000 latinamerica 8672
#11 2000 northamerica 17324
#12 2000 oceania 0
您将数字存储为test.df
中的因子,我们需要将它们更改为实际数字。在应用上述方法之前运行以下命令。
test.df[-1] <- lapply(test.df[-1], function(x) as.numeric(as.character(x)))
我们可以在tidyverse
做到这一点。将命名的list
转换为两列数据集(enframe
或stack
),然后在full_join
ing'year.list'中包含的'year'之后用'test.df'执行filter
,按'name',year'分组,得到'值'的sum
和spread
它'宽'格式
library(tidyverse)
enframe(country.list.continent, value = "origin") %>%
unnest %>%
full_join(test.df %>%
filter(year %in% year.list)) %>%
group_by(name, year) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
spread(year, value, fill = 0) %>%
select(-4)
# A tibble: 6 x 3
# Groups: name [6]
# name `1998` `2000`
# <chr> <dbl> <dbl>
#1 africa 0 0
#2 asia 33038 18485
#3 europe 36658 35874
#4 latinamerica 14323 14808
#5 northamerica 15697 27405
#6 oceania 0 0
或者在base R
中,这可以通过stack
ing list
到两列data.frame,merge
和subset
ing之后的'test.df',并与xtabs
创建一个表来完成
xtabs(value ~ ind + year, merge(stack(country.list.continent),
subset(test.df, year %in% year.list), by.x = "values", by.y = "origin"))
# year
#ind 1998 2000
# asia 33038 18485
# africa 0 0
# europe 36658 35874
# oceania 0 0
# northamerica 15697 27405
# latinamerica 14323 14808
test.df <- data.frame(origin, year, value)