我有一个包含
.tif
文件的文件夹,我想将它们合并到数据立方体中。我所说的 datacube 指的是 netcdf
文件,或 zarr
文件。目标是,如果我在 Python 中打开这个数据立方体,我可以访问表示 tif
文件堆栈的 3D 数组。
上下文: 文件在空间上从不重叠,但缝合在一起它们覆盖了一大片区域。它们以日期命名:某些 tif 具有相同的日期(在这种情况下,它们会与
xr.concat(dim='x')
合并)。
目标:我想将所有这些 tif 合并到一个数据集中并保存该数据集。我想用
dask
来做到这一点,这样我就不会遇到内存问题。
问题:
dask
连接数据集并将其保存为 zarr 或 netcdf,以免内存过载?代码示例:
import xarray as xr
import pandas as pd
from collections import Counter
# List of dates, 3 files share the same date
list_dates = [
pd.Timestamp('2014-10-08 00:00:00'),
pd.Timestamp('2014-10-13 00:00:00'),
pd.Timestamp('2014-10-15 00:00:00'),
pd.Timestamp('2014-10-15 00:00:00'),
pd.Timestamp('2014-10-15 00:00:00')
]
# In the list of files, the last 3 files have the same date
list_files = [
'2014-10-08_0.tif',
'2014-10-13_0.tif',
'2014-10-15_0.tif',
'2014-10-15_1.tif',
'2014-10-15_2.tif'
]
# Count how many times each date appears
date_counts = Counter(list_dates)
# Initialize the list hosting the DataArrays
data_arrays = []
# Initialize the counter
i = 0
# Loop that appends the DataArrays and concatenates them if the files have the same date
while i < len(list_files):
# If the date is unique, append the DataArrays list
if date_counts[list_dates[i]] == 1:
# Open the file as a dataarray (index [0] because it is opened as a 3D array even though it is 2D)
data_array = xr.open_dataarray(list_files[i][0])
# Add the timestamp as a dimension
data_array = data_array.expand_dims({'time': [list_dates[i]]})
data_arrays.append(data_array)
i += 1 # Update the iterator
# If the date has multiple occurrences, concatenate spatially the files sharing the same date
else:
# Open all the files with the same date as dataarrays and store them in a temporary list
ds_temp = xr.concat([xr.open_dataarray(list_files[j])[0] for j in range(i, i + date_counts[list_dates[i]])], dim='x')
# Add a time dimension
ds_temp = ds_temp.expand_dims({'time': [list_dates[i]]})
data_arrays.append(ds_temp)
i += date_counts[list_dates[i]] # Update the iterator by the amount of files that were used in the concatenation
# Concatenate all the DataArrays in the final list along the 'time' dimension
final_dataset = xr.concat(data_arrays, dim='time')
# Print the final dataset
print(final_dataset)
我不知道如何让 xarray 动态调整大小或打开一个没有数据的新数组,以便您可以填充它 - 这实际上就是您正在做的事情。我怀疑这是完全可能的。
但是,这里有两种方法可以实现您正在做的事情。
kerchunk 应该允许您创建所有文件的全局数据集视图。每个文件要么是一个块,要么解决文件的内部分块方案(与 CoG 配合良好,另请参阅tiffile)。在这种情况下,您根本不需要复制数据,但可以决定将其转换为 zarr 作为单独的步骤。弄清楚这条路有点复杂。
zarr 直接允许您制作任何形状的数组并填充它们。在当前的工作流程中,您需要确定预期的数据形状
.
g = zarr.open_group("mydir", mode="w")
time = g.create_dataset("time", data=list_dates, dtype="M8[ms]")
x = g.create_dataset("x", data=range-of-x)
y = g.create_dataset("y", data=range-of-x)
data = g.create_dataset("data", dtype="float64", shape=(5, <y size> , <concatenated x size>), chunks=(1, <ysize>, <single x size>)
要使 xarray 全部可加载,您需要在每个数组的 .attrs 中添加“_ARRAY_DIMENSIONS”
time.attrs["_ARRAY_DIMENSIONS"] = ["time"]
x.attrs["_ARRAY_DIMENSIONS"] = ["x"]
y.attrs["_ARRAY_DIMENSIONS"] = ["y"]
data.attrs["_ARRAY_DIMENSIONS"] = ["time", "y", "x"]
然后填写数据
for file...
chunk = load(..) # with xarray, probably
data[timepoint, yslice, xslice] = chunk