优化从每个多边形的栅格中提取值

问题描述 投票:0回答:1

我有多个栅格层(每小时一个;覆盖美国本土)和一个多边形矢量图层。我使用

terra::extract
从栅格图层中提取每个多边形的平均值。该过程对于有限数量的多边形和栅格层来说效果很好,但由于我想将其扩展到具有 1000 多个多边形的几个月,我想知道是否有更优化的方法来解决这个问题。

下面的代码片段显示了一个最小的可重现示例以及所需的输出。我想到了两条可能的路线,但到目前为止还没有实现它们:

  1. 我考虑过并行处理,但我不确定如何在不遇到序列化问题的情况下实现这一点。

  2. 我没有在每个栅格上使用

    extract(..., fun = mean)
    ,而是考虑做类似
    extract(mrast[[1]], mpoly, cells = TRUE, exact = TRUE)
    的事情,并获取单元格数量以及每个多边形的精确覆盖范围。从那里,我可能可以每小时对这些单元进行子集化(所有栅格层都具有相同的单元),将它们的数量乘以它们的精确分数,并以这种方式计算平均值。

#setwd("C:/temp")
library(terra)
library(R.utils)

murl <- "https://mtarchive.geol.iastate.edu/2023/06/27/mrms/ncep/MultiSensor_QPE_01H_Pass2/"
mfiles <- c("MultiSensor_QPE_01H_Pass2_00.00_20230627-070000.grib2.gz", 
           "MultiSensor_QPE_01H_Pass2_00.00_20230627-080000.grib2.gz")

## download, unzip, read the raster files to R
lapply(seq_along(mfiles), \(i) download.file(paste0(murl, mfiles[i]), mfiles[i])) 
lapply(mfiles, \(f) gunzip(f, gsub(".gz", "", f), remove = TRUE, overwrite=TRUE))

mrast <- lapply(list.files(pattern = ".grib2$"), \(f) rast(f)) 
lapply(seq_along(mrast), \(i) names(mrast[[i]]) <<- time(mrast[[i]]))

## creating a polygon shapefile
mpoly <- vect(dptply, "polygon") ## get dptply at the bottom of the post
crs(mpoly) <- "EPSG:3857"
mpoly <- project(mpoly, "+proj=longlat +datum=WGS84")

## extracting the averages (this part needs optimization)
startTime = Sys.time()
mavg <- lapply(mrast, \(r) {op <- extract(r, mpoly, 
                                         weights=TRUE, fun=mean, na.rm = TRUE)
                            message(time(r))
                            return(op)})
endTime = Sys.time()
print(endTime - startTime)

2023-06-27 07:00:00
2023-06-27 08:00:00
Time difference of 10.87896 secs
## just for demonstration
cbind(mavg[[1]][1], do.call(cbind, lapply(mavg, "[", 2)))
#>    ID 2023-06-27 07:00:00 2023-06-27 08:00:00
#> 1   1            0.000000           10.405161
#> 2   2            0.000000            8.961424
#> 3   3            0.000000            6.300000
#> 4   4            0.000000            5.902914
#> 5   5            0.000000            5.462630
#> 6   6            0.000000            4.100000
#> 7   7            0.000000            3.566511
#> 8   8            0.000000           13.275161
#> 9   9            2.600000           11.200000
#> 10 10            4.633752           16.301403

数据:

dptply <- structure(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 
                      4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 
                      8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 1, 1, 1, 1, 
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                      1, 1, 1, 1, 1, 1, 1, -7945530.9815, -7938378.0621, -7938605.1389, 
                      -7945190.3663, -7945530.9815, -7938264.5237, -7936674.9861, -7930543.9123, 
                      -7932360.5267, -7936674.9861, -7938264.5237, -7938264.5237, -7934698.4679, 
                      -7933886.647, -7933878.7652, -7934666.9409, -7934698.4679, -7933735.8469, 
                      -7932622.0601, -7932634.9736, -7933745.532, -7933735.8469, -7934646.4483, 
                      -7933925.1099, -7933904.9326, -7934631.3153, -7934646.4483, -7933551.8299, 
                      -7932850.6689, -7932865.8018, -7933496.3424, -7933551.8299, -7931645.765, 
                      -7931670.3955, -7932384.6797, -7932470.8864, -7931645.765, -7940928.6868, 
                      -7941271.2217, -7941328.3108, -7948978.2562, -7949434.9693, -7940928.6868, 
                      -7998760.8958, -7999220.4225, -7998966.0077, -7998636.765, -7998760.8958, 
                      -8001395.2248, -8001358.6877, -8003002.8551, -8003149.0033, -8001395.2248, 
                      5262480.6428, 5262635.0714, 5251676.934, 5251831.1854, 5262480.6428, 
                      5262789.5025, 5268041.6687, 5261862.9539, 5252293.955, 5248438.2361, 
                      5248900.8391, 5262789.5025, 5269309.6397, 5269309.6397, 5269052.1707, 
                      5269041.443, 5269309.6397, 5269446.29, 5269441.8957, 5267934.7856, 
                      5267925.9984, 5269446.29, 5271218.9923, 5271198.3903, 5270580.3524, 
                      5270607.8199, 5271218.9923, 5272036.2399, 5272049.9757, 5271507.4245, 
                      5271239.5943, 5272036.2399, 5272266.345, 5270589.7139, 5270556.1843, 
                      5272216.0417, 5272266.345, 5246883.428, 5237972.8, 5237895.3531, 
                      5238282.5936, 5247503.6111, 5246883.428, 5227512.1703, 5227597.3282, 
                      5228266.5923, 5228286.8738, 5227512.1703, 5232252.738, 5229875.3018, 
                      5229528.6423, 5232104.1307, 5232252.738, 0, 0, 0, 0, 0, 0, 0, 
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                      0, 0, 0, 0), 
                    dim = c(53L, 5L), 
                    dimnames = list(NULL, c("geom", "part", "x", "y", "hole")))
r parallel-processing raster terra
1个回答
0
投票

你自己已经回答得差不多了,这是我将如何使用选项 2 来解决这个问题。

首先让我们找出哪些单元格(

cell
列)位于哪些多边形(
ID
)中以及它们的比例(
fraction
)。这需要时间,但只会执行一次。

cells <- terra::extract(mrast[[1]], mpoly, cells=TRUE, exact=TRUE)
head(cells)
#   ID 2023-06-27 08:00:00    cell  fraction
# 1  1                 9.8 8629863 0.2353906
# 2  1                 9.6 8629864 0.4162643
# 3  1                 9.8 8629865 0.4321375
# 4  1                10.2 8629866 0.4480107
# 5  1                10.3 8629867 0.4638838
# 6  1                10.1 8629868 0.4797569

使用这些信息,我们可以提取多边形的数据,而无需处理其几何形状,只需使用上面的单元格编号即可。这是一个函数,它采用栅格 (

rast
) 和上面的数据
cell_data
作为参数,并计算每个多边形的加权(准确地说!)平均值:

polygon_means <- function(rast, cell_data) {
  # values for cells
  ex <- terra::extract(rast, cell_data$cell)
  # weighted means by polygons
  tapply(ex[, 1] * cell_data$fraction, cell_data$ID, sum) / 
    tapply(cell_data$fraction, cell_data$ID, sum)
}

现在我们可以进行计算了:

names(mrast) <- sapply(mrast, names)
mavg <- sapply(mrast, polygon_means, cells)
mavg
#    2023-06-27 07:00:00 2023-06-27 08:00:00
# 1             0.000000           10.405166
# 2             0.000000            8.959888
# 3             0.000000            6.300000
# 4             0.000000            5.902433
# 5             0.000000            5.462595
# 6             0.000000            4.100000
# 7             0.000000            3.566484
# 8             0.000000           13.275182
# 9             2.600000           11.200000
# 10            4.633801           16.301409

这在我的机器上大约需要 3 秒,比原始代码快了大约 10 倍(是的,在我的笔记本电脑上需要 30 秒)。

考虑您的选项 1,即并行化,它的有用性可能取决于您的计算环境。由于

SpatRaster
数据实际上并不在 RAM 中,每次访问时都必须从磁盘读取,因此在单磁盘机器上,这可能是主要瓶颈,在多个处理器核心之间分配计算无助于解决。

© www.soinside.com 2019 - 2024. All rights reserved.