读取HDF5作为Dask Dataframe时出错,为什么?

问题描述 投票:0回答:1

1。我的问题

[尝试使用Dask读取HDF5文件时出现下一个错误,但我不知道为什么

>>> dd.read_hdf("test.h5", key="/RECORDS/STATES")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/dask/dataframe/io/hdf.py", line 514, in read_hdf
    for path in paths
  File "/usr/local/lib/python3.7/site-packages/dask/dataframe/io/hdf.py", line 514, in <listcomp>
    for path in paths
  File "/usr/local/lib/python3.7/site-packages/dask/dataframe/io/hdf.py", line 382, in _read_single_hdf
    for k, s, d in zip(keys, stops, divisions)
  File "/usr/local/lib/python3.7/site-packages/dask/dataframe/multi.py", line 1071, in concat
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

2。 HDF5文件

我使用Dask尝试读取的文件是由我使用HDF5的C API生成的。如果您要求的话,为了提高性能,我使用C而不是Python(numpy,pandas)生成HDF5,因为我需要解析许多GB的ASCII格式的未格式化数据。数据作为HDF5表(https://portal.hdfgroup.org/display/HDF5/Tables)存储在文件中。我文件的标题看起来像这样:

HDF5 "rhoPimpleExtrae10TimeSteps.00.1iter.h5" {
GROUP "/" {
   ATTRIBUTE "hdf5_metadata_apps" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   ATTRIBUTE "hdf5_metadata_date" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   ATTRIBUTE "hdf5_metadata_hwcpu" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 48 ) / ( 48 ) }
   }
   ATTRIBUTE "hdf5_metadata_hwnodes" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   }
   ATTRIBUTE "hdf5_metadata_name" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   ATTRIBUTE "hdf5_metadata_nodes" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   }
   ATTRIBUTE "hdf5_metadata_path" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   ATTRIBUTE "hdf5_metadata_threads" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 48 ) / ( 48 ) }
   }
   ATTRIBUTE "hdf5_metadata_time" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SCALAR
   }
   ATTRIBUTE "hdf5_metadata_type" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   GROUP "RECORDS" {
      DATASET "COMMUNICATIONS" {
         DATATYPE  H5T_COMPOUND {
            H5T_STD_U32LE "CPU Send ID";
            H5T_STD_U32LE "Phy. Task Send ID";
            H5T_STD_U32LE "Log. Task Send ID";
            H5T_STD_U32LE "Thread Send ID";
            H5T_STD_U64LE "Log. Send Time";
            H5T_STD_U64LE "Phy. Send Time";
            H5T_STD_U32LE "CPU Receive ID";
            H5T_STD_U32LE "Phy. Task Receive ID";
            H5T_STD_U32LE "Log. Task Receive ID";
            H5T_STD_U32LE "Thread Receive ID";
            H5T_STD_U64LE "Log. Receive Time";
            H5T_STD_U64LE "Phy. Receive Time";
            H5T_STD_U64LE "Size";
            H5T_STD_U64LE "Tag";
         }
         DATASPACE  SIMPLE { ( 67574 ) / ( H5S_UNLIMITED ) }
         ATTRIBUTE "CLASS" {
            DATATYPE  H5T_STRING {
               STRSIZE 6;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_0_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 12;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_10_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 18;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_11_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 18;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_12_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 5;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_13_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 4;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_1_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 18;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_2_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 18;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_3_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 15;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_4_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 15;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_5_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 15;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_6_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 15;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_7_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 21;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_8_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 21;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_9_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 18;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "TITLE" {
            DATATYPE  H5T_STRING {
               STRSIZE 22;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "VERSION" {
            DATATYPE  H5T_STRING {
               STRSIZE 4;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
      }
      DATASET "EVENTS" {
         DATATYPE  H5T_COMPOUND {
            H5T_STD_U32LE "CPU ID";
            H5T_STD_U16LE "APP ID";
            H5T_STD_U32LE "Task ID";
            H5T_STD_U32LE "Thread ID";
            H5T_STD_U64LE "Time";
            H5T_STD_U64LE "Event Type";
            H5T_STD_U64LE "Event Value";
         }
         DATASPACE  SIMPLE { ( 3643006 ) / ( H5S_UNLIMITED ) }
         ATTRIBUTE "CLASS" {
            DATATYPE  H5T_STRING {
               STRSIZE 6;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_0_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 7;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_1_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 7;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_2_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 8;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_3_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 10;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_4_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 5;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_5_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 11;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_6_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 12;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "TITLE" {
            DATATYPE  H5T_STRING {
               STRSIZE 14;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "VERSION" {
            DATATYPE  H5T_STRING {
               STRSIZE 4;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
      }
      DATASET "STATES" {
         DATATYPE  H5T_COMPOUND {
            H5T_STD_U32LE "CPU ID";
            H5T_STD_U16LE "APP ID";
            H5T_STD_U32LE "Task ID";
            H5T_STD_U32LE "Thread ID";
            H5T_STD_U64LE "Time ini";
            H5T_STD_U64LE "Time fi";
            H5T_STD_U16LE "State";
         }
         DATASPACE  SIMPLE { ( 301496 ) / ( H5S_UNLIMITED ) }
         ATTRIBUTE "CLASS" {
            DATATYPE  H5T_STRING {
               STRSIZE 6;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_0_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 7;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_1_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 7;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_2_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 8;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_3_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 10;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_4_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 9;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_5_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 8;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "FIELD_6_NAME" {
            DATATYPE  H5T_STRING {
               STRSIZE 6;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "TITLE" {
            DATATYPE  H5T_STRING {
               STRSIZE 14;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
         ATTRIBUTE "VERSION" {
            DATATYPE  H5T_STRING {
               STRSIZE 4;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
         }
      }
   }
}
}

我基本上在/ RECORDS下有3个数据集(状态,事件和通讯)。我认为我的HDF5没有什么奇怪的。我尝试使用Pandas和Dask数组加载这些数据集,并且可以正常工作。

3。我想知道的是

我的HDF5文件出了什么问题,使Dask无法将其作为数据帧读取?

我已经尝试在Dask文档中找到HDF5文件必须满足哪些要求,但是没有涉及此主题的内容。如果至少我知道我的文件有什么问题,我将可以修复它。

dataframe dask hdf5
1个回答
0
投票

PR https://github.com/pandas-dev/pandas/pull/32723最近被合并为dask master,幸运的是,为您解决了这个问题。

© www.soinside.com 2019 - 2024. All rights reserved.