问题: 我有一个代表张量的大型 Dask 数组,我想使用 SciPy 包中的缩放函数重新缩放它。重新缩放后,我想使用 dask.array.to_zarr 或 dask.array.to_hdf5 将生成的 Dask 数组保存到磁盘。下面,我提供一个简单的例子以便更好地理解。
示例: 假设我有一个代表二维矩阵的 Dask 数组数据,代码如下:
import dask.array as da
data = da.random.randint(0,100, (100,100), chunks=(20,20))
data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16)
data_upsampled.to_hdf5('myfile.hdf5', '/up_sampled')
但是,我收到此错误:
---------------------------------------------------------------------------
\`TypeError Traceback (most recent call last)
Cell In\[467\], line 9
4 disp_ds = da.random.randint(0, 100, (100,100), chunks=(20,20))
5 disp_org = da.map_blocks(lambda x: zoom(x,2), disp_ds, dtype = np.uint16)
\----\> 9 disp_org.to_hdf5('myfile.hdf5', '/up_sampled')
File \\AppData\\Local\\anaconda3\\envs\\napari\\lib\\site-packages\\dask\\array\\core.py:1811, in Array.to_hdf5(self, filename, datapath, \*\*kwargs)
1797 def to_hdf5(self, filename, datapath, \*\*kwargs):
1798 """Store array in HDF5 file
1799
1800 \>\>\> x.to_hdf5('myfile.hdf5', '/x') # doctest: +SKIP
(...)
1809 h5py.File.create_dataset
1810 """
\-\> 1811 return to_hdf5(filename, datapath, self, \*\*kwargs)
File \\AppData\\Local\\anaconda3\\envs\\napari\\lib\\site-packages\\dask\\array\\core.py:5387, in to_hdf5(filename, chunks, \*args, \*\*kwargs)
5376 with h5py.File(filename, mode="a") as f:
5377 dsets = \[
5378 f.require_dataset(
5379 dp,
(...)
5385 for dp, x in data.items()
5386 \]
...
267 # All dimensions from target_shape should either have been popped
268 # to match the selection shape, or be 1.
269 raise TypeError("Can't broadcast %s -\> %s" % (source_shape, self.array_shape)) # array shape
TypeError: Can't broadcast (40, 40) -\> (20, 20)
根据上面的示例,我了解到使用缩放功能我正在更改块大小。但我找不到以优化的方式解决这个问题的方法。
我感谢任何有关如何使用 Dask 高效执行重新缩放和保存操作的帮助或建议。谢谢!
您只需在
map_block
调用中告诉 Dask 块的最终形状:
data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16, chunks=(40,40))
完整的工作代码:
import dask.array as da
import numpy as np
import h5py
from scipy.ndimage import zoom
data = da.random.randint(0,100, (100,100), chunks=(20,20))
data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16, chunks=(40,40))
data_upsampled.to_hdf5('myfile.hdf5', '/up_sampled')