如何使用dask.map_blocks和scipy的缩放功能重新缩放dask数组并将结果保存为zarr或hdf5?

问题描述 投票:0回答:1

问题: 我有一个代表张量的大型 Dask 数组,我想使用 SciPy 包中的缩放函数重新缩放它。重新缩放后,我想使用 dask.array.to_zarr 或 dask.array.to_hdf5 将生成的 Dask 数组保存到磁盘。下面,我提供一个简单的例子以便更好地理解。

示例: 假设我有一个代表二维矩阵的 Dask 数组数据,代码如下:

   import dask.array as da

   data = da.random.randint(0,100, (100,100), chunks=(20,20))

   data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16)

   data_upsampled.to_hdf5('myfile.hdf5', '/up_sampled')

但是,我收到此错误:

---------------------------------------------------------------------------

\`TypeError                                 Traceback (most recent call last)
Cell In\[467\], line 9
4 disp_ds = da.random.randint(0, 100, (100,100), chunks=(20,20))
5 disp_org = da.map_blocks(lambda x: zoom(x,2), disp_ds, dtype = np.uint16)
\----\> 9 disp_org.to_hdf5('myfile.hdf5', '/up_sampled')

File \\AppData\\Local\\anaconda3\\envs\\napari\\lib\\site-packages\\dask\\array\\core.py:1811, in Array.to_hdf5(self, filename, datapath, \*\*kwargs)
1797 def to_hdf5(self, filename, datapath, \*\*kwargs):
1798     """Store array in HDF5 file
1799
1800     \>\>\> x.to_hdf5('myfile.hdf5', '/x')  # doctest: +SKIP
(...)
1809     h5py.File.create_dataset
1810     """
\-\> 1811     return to_hdf5(filename, datapath, self, \*\*kwargs)

File \\AppData\\Local\\anaconda3\\envs\\napari\\lib\\site-packages\\dask\\array\\core.py:5387, in to_hdf5(filename, chunks, \*args, \*\*kwargs)
5376 with h5py.File(filename, mode="a") as f:
5377     dsets = \[
5378         f.require_dataset(
5379             dp,
(...)
5385         for dp, x in data.items()
5386     \]
...
267     # All dimensions from target_shape should either have been popped
268     # to match the selection shape, or be 1.
269     raise TypeError("Can't broadcast %s -\> %s" % (source_shape, self.array_shape))  # array shape

TypeError: Can't broadcast (40, 40) -\> (20, 20)

根据上面的示例,我了解到使用缩放功能我正在更改块大小。但我找不到以优化的方式解决这个问题的方法。

我感谢任何有关如何使用 Dask 高效执行重新缩放和保存操作的帮助或建议。谢谢!

dask
1个回答
0
投票

您只需在

map_block
调用中告诉 Dask 块的最终形状:

data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16, chunks=(40,40))

完整的工作代码:

import dask.array as da
import numpy as np
import h5py
from scipy.ndimage import zoom

data = da.random.randint(0,100, (100,100), chunks=(20,20))
data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16, chunks=(40,40))
data_upsampled.to_hdf5('myfile.hdf5', '/up_sampled')
© www.soinside.com 2019 - 2024. All rights reserved.