我有多个栅格层(每小时一个;覆盖美国本土)和一个多边形矢量图层。我使用
terra::extract
从栅格图层中提取每个多边形的平均值。该过程对于有限数量的多边形和栅格层来说效果很好,但由于我想将其扩展到具有 1000 多个多边形的几个月,我想知道是否有更优化的方法来解决这个问题。
下面的代码片段显示了一个最小的可重现示例以及所需的输出。我想到了两条可能的路线,但到目前为止还没有实现它们:
我考虑过并行处理,但我不确定如何在不遇到序列化问题的情况下实现这一点。
我没有在每个栅格上使用
extract(..., fun = mean)
,而是考虑做类似extract(mrast[[1]], mpoly, cells = TRUE, exact = TRUE)
的事情,并获取单元格数量以及每个多边形的精确覆盖范围。从那里,我可能可以每小时对这些单元进行子集化(所有栅格层都具有相同的单元),将它们的数量乘以它们的精确分数,并以这种方式计算平均值。
#setwd("C:/temp")
library(terra)
library(R.utils)
murl <- "https://mtarchive.geol.iastate.edu/2023/06/27/mrms/ncep/MultiSensor_QPE_01H_Pass2/"
mfiles <- c("MultiSensor_QPE_01H_Pass2_00.00_20230627-070000.grib2.gz",
"MultiSensor_QPE_01H_Pass2_00.00_20230627-080000.grib2.gz")
## download, unzip, read the raster files to R
lapply(seq_along(mfiles), \(i) download.file(paste0(murl, mfiles[i]), mfiles[i]))
lapply(mfiles, \(f) gunzip(f, gsub(".gz", "", f), remove = TRUE, overwrite=TRUE))
mrast <- lapply(list.files(pattern = ".grib2$"), \(f) rast(f))
lapply(seq_along(mrast), \(i) names(mrast[[i]]) <<- time(mrast[[i]]))
## creating a polygon shapefile
mpoly <- vect(dptply, "polygon") ## get dptply at the bottom of the post
crs(mpoly) <- "EPSG:3857"
mpoly <- project(mpoly, "+proj=longlat +datum=WGS84")
## extracting the averages (this part needs optimization)
startTime = Sys.time()
mavg <- lapply(mrast, \(r) {op <- extract(r, mpoly,
weights=TRUE, fun=mean, na.rm = TRUE)
message(time(r))
return(op)})
endTime = Sys.time()
print(endTime - startTime)
2023-06-27 07:00:00
2023-06-27 08:00:00
Time difference of 10.87896 secs
## just for demonstration
cbind(mavg[[1]][1], do.call(cbind, lapply(mavg, "[", 2)))
#> ID 2023-06-27 07:00:00 2023-06-27 08:00:00
#> 1 1 0.000000 10.405161
#> 2 2 0.000000 8.961424
#> 3 3 0.000000 6.300000
#> 4 4 0.000000 5.902914
#> 5 5 0.000000 5.462630
#> 6 6 0.000000 4.100000
#> 7 7 0.000000 3.566511
#> 8 8 0.000000 13.275161
#> 9 9 2.600000 11.200000
#> 10 10 4.633752 16.301403
dptply <- structure(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8,
8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, -7945530.9815, -7938378.0621, -7938605.1389,
-7945190.3663, -7945530.9815, -7938264.5237, -7936674.9861, -7930543.9123,
-7932360.5267, -7936674.9861, -7938264.5237, -7938264.5237, -7934698.4679,
-7933886.647, -7933878.7652, -7934666.9409, -7934698.4679, -7933735.8469,
-7932622.0601, -7932634.9736, -7933745.532, -7933735.8469, -7934646.4483,
-7933925.1099, -7933904.9326, -7934631.3153, -7934646.4483, -7933551.8299,
-7932850.6689, -7932865.8018, -7933496.3424, -7933551.8299, -7931645.765,
-7931670.3955, -7932384.6797, -7932470.8864, -7931645.765, -7940928.6868,
-7941271.2217, -7941328.3108, -7948978.2562, -7949434.9693, -7940928.6868,
-7998760.8958, -7999220.4225, -7998966.0077, -7998636.765, -7998760.8958,
-8001395.2248, -8001358.6877, -8003002.8551, -8003149.0033, -8001395.2248,
5262480.6428, 5262635.0714, 5251676.934, 5251831.1854, 5262480.6428,
5262789.5025, 5268041.6687, 5261862.9539, 5252293.955, 5248438.2361,
5248900.8391, 5262789.5025, 5269309.6397, 5269309.6397, 5269052.1707,
5269041.443, 5269309.6397, 5269446.29, 5269441.8957, 5267934.7856,
5267925.9984, 5269446.29, 5271218.9923, 5271198.3903, 5270580.3524,
5270607.8199, 5271218.9923, 5272036.2399, 5272049.9757, 5271507.4245,
5271239.5943, 5272036.2399, 5272266.345, 5270589.7139, 5270556.1843,
5272216.0417, 5272266.345, 5246883.428, 5237972.8, 5237895.3531,
5238282.5936, 5247503.6111, 5246883.428, 5227512.1703, 5227597.3282,
5228266.5923, 5228286.8738, 5227512.1703, 5232252.738, 5229875.3018,
5229528.6423, 5232104.1307, 5232252.738, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0),
dim = c(53L, 5L),
dimnames = list(NULL, c("geom", "part", "x", "y", "hole")))
你自己已经回答得差不多了,这是我将如何使用选项 2 来解决这个问题。
首先让我们找出哪些单元格(
cell
列)位于哪些多边形(ID
)中以及它们的比例(fraction
)。这需要时间,但只会执行一次。
cells <- terra::extract(mrast[[1]], mpoly, cells=TRUE, exact=TRUE)
head(cells)
# ID 2023-06-27 08:00:00 cell fraction
# 1 1 9.8 8629863 0.2353906
# 2 1 9.6 8629864 0.4162643
# 3 1 9.8 8629865 0.4321375
# 4 1 10.2 8629866 0.4480107
# 5 1 10.3 8629867 0.4638838
# 6 1 10.1 8629868 0.4797569
使用这些信息,我们可以提取多边形的数据,而无需处理其几何形状,只需使用上面的单元格编号即可。这是一个函数,它采用栅格 (
rast
) 和上面的数据 cell_data
作为参数,并计算每个多边形的加权(准确地说!)平均值:
polygon_means <- function(rast, cell_data) {
# values for cells
ex <- terra::extract(rast, cell_data$cell)
# weighted means by polygons
tapply(ex[, 1] * cell_data$fraction, cell_data$ID, sum) /
tapply(cell_data$fraction, cell_data$ID, sum)
}
现在我们可以进行计算了:
names(mrast) <- sapply(mrast, names)
mavg <- sapply(mrast, polygon_means, cells)
mavg
# 2023-06-27 07:00:00 2023-06-27 08:00:00
# 1 0.000000 10.405166
# 2 0.000000 8.959888
# 3 0.000000 6.300000
# 4 0.000000 5.902433
# 5 0.000000 5.462595
# 6 0.000000 4.100000
# 7 0.000000 3.566484
# 8 0.000000 13.275182
# 9 2.600000 11.200000
# 10 4.633801 16.301409
这在我的机器上大约需要 3 秒,比原始代码快了大约 10 倍(是的,在我的笔记本电脑上需要 30 秒)。
考虑您的选项 1,即并行化,它的有用性可能取决于您的计算环境。由于
SpatRaster
数据实际上并不在 RAM 中,每次访问时都必须从磁盘读取,因此在单磁盘机器上,这可能是主要瓶颈,在多个处理器核心之间分配计算无助于解决。