HDF5到CSV转换期间出现多个错误

问题描述 投票:0回答:2

我有一个巨大的h5文件,需要将每个数据集提取到一个单独的csv文件中。该模式类似于/ Genotypes / GroupN / SubGroupN / calls,其中包含“ N”个组和“ N”个子组。我已经创建了与主文件具有相同结构的示例h5文件,并测试了可以正常工作的代码,但是当我在主h5文件上应用代码时,它会遇到各种错误。HDF5文件的架构:

/Genotypes
    /genotype a
        /genotype a_1 #one subgroup for each genotype group
            /calls #data that I need to extract to csv file
            depth #data
    /genotype b
        /genotype b_1 #one subgroup for each genotype group
            /calls #data
            depth #data
    .
    .
    .
    /genotype n #1500 genotypes are listed as groups
        /genotype n_1
            /calls 
            depth

/Positions
    /allel #data 
    chromo #data#
/Taxa 
    /genotype a
        /genotype a_1
    /genotype b
        /genotype b_1 #one subgroup for each genotype group
    .
    .
    .
    /genotype n #1500 genotypes are listed as groups
        /genotype n_1

/_Data-Types_
    Enum_Boolean
    String_VariableLength

这是用于创建示例h5文件的代码:

import h5py  
import numpy as np  
    ngrps = 2  
    nsgrps = 3  
    nds = 4  
    nrows = 10  
    ncols = 2  

    i_arr_dtype = ( [ ('col1', int), ('col2', int) ] )  
    with h5py.File('d:/Path/sample_file.h5', 'w') as h5w :  
        for gcnt in range(ngrps):  
            grp1 = h5w.create_group('Group_'+str(gcnt))  
            for scnt in range(nsgrps):  
                grp2 = grp1.create_group('SubGroup_'+str(scnt))  
                for dcnt in range(nds):  
                    i_arr = np.random.randint(1,100, (nrows,ncols) )  
                    ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)  

我如下使用numpy

import h5py
import numpy as np

def dump_calls2csv(name, node):    

    if isinstance(node, h5py.Dataset) and 'calls' in node.name :
       print ('visiting object:', node.name, ', exporting data to CSV')
       csvfname = node.name[1:].replace('/','_') +'.csv'
       arr = node[:]
       np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')

##########################    

with h5py.File('d:/Path/sample_file.h5', 'r') as h5r :        
    h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!

我也使用了PyTables,如下所示:

import tables as tb
import numpy as np

with tb.File('sample_file.h5', 'r') as h5r :     
    for node in h5r.walk_nodes('/',classname='Leaf') :         
       print ('visiting object:', node._v_pathname, 'export data to CSV')
       csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
       np.savetxt(csvfname, node.read(), fmt='%5d', delimiter=',')

但是我看到下面提到的每种方法的错误:

 C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
visiting object: /Genotypes/Genotype a/genotye a_1/calls , exporting data to CSV
.
.
.
some of the datasets
.
.
.
Traceback (most recent call last):
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 31, in <module>
    h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 565, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py\h5o.pyx", line 355, in h5py.h5o.visit
  File "h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name
  File "h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 564, in proxy
    return func(name, self[name])
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 10, in dump_calls2csv
    np.savetxt(csv_name, arr, fmt='%5d', delimiter=',')
  File "<__array_function__ internals>", line 6, in savetxt
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1377, in savetxt
    open(fname, 'wt').close()
OSError: [Errno 22] Invalid argument: 'Genotypes_Genotype_Name-Genotype_Name2_calls.csv'

Process finished with exit code 1

并且第二个代码的错误是:

C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'locked' in node 'Genotypes'. Offending HDF5 class: 8
  value = self._g_getattr(self._v_node, name)
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'retainRareAlleles' in node 'Genotypes'. Offending HDF5 class: 8
  value = self._g_getattr(self._v_node, name)
visiting object: /Genotypes/AlleleStates export data to CSV
Traceback (most recent call last):
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1447, in savetxt
    v = format % tuple(row) + newline
TypeError: %d format: a number is required, not numpy.bytes_

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 40, in <module>
    np.savetxt(csvfname, node.read(), fmt= '%d', delimiter=',')
  File "<__array_function__ internals>", line 6, in savetxt
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1451, in savetxt
    % (str(X.dtype), format))
TypeError: Mismatch between array dtype ('|S1') and format specifier ('%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d')

Process finished with exit code 1

有人可以帮我解决这个问题吗?请提及我需要在代码上应用的确切更改,并提供完整的代码,因为我的背景是注释代码,如果提供进一步的说明将非常有用。

numpy hdf5 h5py pytables
2个回答
0
投票

这还不是一个完整的答案。我正在用它来格式化我对上述评论的问题。您的组/数据集名称中是否有空格?如果是这样,我认为在我的简单示例中就是一个问题。我从组/数据集名称路径创建每个CSV文件名。我用“ _”替换了每个“ /”。您需要对空格进行相同的操作(通过添加.replace(' ','-')将每个''替换为'-'。打印csvfname变量以确认其按预期工作(并创建了有效的文件名)。

如果这还不足以解决您的问题,请继续阅读。我知道了:/Genotypes/genotype a/genotype a-1/calls是要写入CSV的数据集(每个genotype x/genotype x-i/calls数据集1个)。如果是这样,您可能在数据集中的数据和用于写入它的格式之间不匹配。首先在dtype中打印dump_calls2csv(),如下所示:print(arr.dtype)。注释掉np.savetxt()行,直到它起作用。从错误消息中,我希望您会得到"|S1"而不是整数,这是一个问题,因为我的示例打印的是整数格式:fmt='%d'。理想情况下,您获取数据集/数组的dtype,然后创建fmt=字符串进行匹配。

希望有帮助。如果没有,请用新信息更新您的问题。


0
投票

我从您的评论中下载了示例。根据我的发现,这是一个新答案。如果所有calls数据集均具有整数数据,则fmt='%d'格式应适用。我发现的唯一问题是从组/数据集路径创建的文件名中的无效字符。例如,在某些组名中使用:?。我修改了dump_calls2csv(),将:替换为-,然后将?替换为#。运行此命令,您将获得以CSV文件编写的前61个数据集。参见下面的新代码:

def dump_calls2csv(name, node):         
    if isinstance(node, h5py.Dataset) and 'calls' in node.name :
       csvfname = node.name[1:] +'.csv'
       csvfname = csvfname.replace('/','_') # create csv file name from path
       csvfname = csvfname.replace(':','-') # modify invalid character
       csvfname = csvfname.replace('?','#') # modify invalid character
       print ('export data to CSV:', csvfname)
       np.savetxt(csvfname, node[:], fmt='%d', delimiter=',')

我打印csvfname以确认字符替换是否按预期工作。另外,如果名称有误,则有助于识别问题数据集。

希望有帮助。运行此程序时请耐心等待。当我测试时,大约有一半的CSV文件已在45分钟内写入。在这一点上,我认为唯一的问题是文件名中的字符,与HDF5,h5pynp.savetxt()无关。对于一般情况(具有任何组/数据集名称),应该进行测试以检查任何无效的文件名字符。

© www.soinside.com 2019 - 2024. All rights reserved.