我是否可以将目录路径转换为可以馈入python hdf5数据表的内容?

问题描述 投票:0回答:1

我想知道如何将字符串或路径转换为可以馈入hdf5表的内容。例如,我从Pytorch数据加载器返回一个numpy img数组,标签和图像的路径,其中图像的路径如下所示:

('mults/train/0/5678.ndpi/40x/40x-236247-16634-80384-8704.png',)

我基本上想将其输入到hdf5表中,如下所示:

hdf5_file = h5py.File(path, mode='w')
hdf5_file.create_dataset(str(phase) + '_img_paths', (len(dataloaders_dict[phase]),))

我不确定我想做的事是否可行。也许将这样的数据输入表格是错误的。

我尝试过:

hdf5_file.create_dataset(str(phase) + '_img_paths', (len(dataloaders_dict[phase]),),dtype="S10")

但是出现此错误:

 hdf5_file[str(phase) + '_img_paths'][i] = str(paths40x)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/anaconda3/lib/python3.6/site-packages/h5py/_hl/dataset.py", line 708, in __setitem__
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 211, in h5py.h5d.DatasetID.write
  File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1713, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U64')
python numpy hdf5 h5py
1个回答
0
投票

您可以在h5py或PyTables中创建标准数据集,并使用任意大的字符串大小。这是最简单的方法,但存在以下风险:您任意大的字符串还不够大。 :)

    或者,可以创建可变长度数据集。 PyTables将此数据集类型称为VLArray,并且该对象使用Class VLStringAtom()。 h5py使用标准数据集,但dtype引用了special_dtype(vlen = str)(请注意,如果您使用的是h5py 2.10,则可以使用string_dtype()代替。)>
  1. 我创建了一个示例,展示了如何针对PyTables和h5py执行此操作。它围绕您注释中的过程参考构建。我并没有复制所有代码-只是检索文件名并对其进行随机排列所必需的。另外,我发现的kaggle数据集具有不同的目录结构,因此我修改了cat_dog_train_path进行匹配。
  • from random import shuffle import glob shuffle_data = True # shuffle the addresses before saving cat_dog_train_path = '.\PetImages\*\*.jpg' # read addresses and labels from the 'train' folder addrs = glob.glob(cat_dog_train_path, recursive=True) print (len(addrs)) labels = [0 if 'cat' in addr else 1 for addr in addrs] # 0 = Cat, 1 = Dog # to shuffle data if shuffle_data: c = list(zip(addrs, labels)) shuffle(c) addrs, labels = zip(*c) # Divide the data into 10% train only, no validation or test train_addrs = addrs[0:int(0.1*len(addrs))] train_labels = labels[0:int(0.1*len(labels))] print ('Check glob list data:') print (train_addrs[0]) print (train_addrs[-1]) import tables as tb # Create a hdf5 file with PyTaables and create VLArrays # filename to save the hdf5 file hdf5_path = 'PetImages_data_1.h5' with tb.File(hdf5_path, mode='w') as h5f: train_files_ds = h5f.create_vlarray('/', 'train_files', atom=tb.VLStringAtom() ) # loop over train addresses for i in range(len(train_addrs)): # print how many images are saved every 1000 images if i % 500 == 0 and i > 1: print ('Train data: {}/{}'.format(i, len(train_addrs)) ) addr = train_addrs[i] train_files_ds.append(train_addrs[i].encode('utf-8')) with tb.File(hdf5_path, mode='r') as h5f: train_files_ds = h5f.root.train_files print ('Check PyTables data:') print (train_files_ds[0].decode('utf-8')) print (train_files_ds[-1].decode('utf-8')) import h5py # Create a hdf5 file with h5py and create VLArrays # filename to save the hdf5 file hdf5_path = 'PetImages_data_2.h5' with h5py.File(hdf5_path, mode='w') as h5f: dt = h5py.special_dtype(vlen=str) # can use string_dtype() wiuth h5py 2.10 train_files_ds = h5f.create_dataset('/train_files', (len(train_addrs),), dtype=dt ) # loop over train addresses for i in range(len(train_addrs)): # print how many images are saved every 1000 images if i % 500 == 0 and i > 1: print ('Train data: {}/{}'.format(i, len(train_addrs)) ) addr = train_addrs[i] train_files_ds[i]= train_addrs[i] with h5py.File(hdf5_path, mode='r') as h5f: train_files_ds = h5f['train_files'] print ('Check h5py data:') print (train_files_ds[0]) print (train_files_ds[-1])
  • © www.soinside.com 2019 - 2024. All rights reserved.