我使用 Azure 流分析来转换 parquet 文件中的一些 json 文档。
对于大多数人来说,我可以在之后阅读它们,但对于其中一些人,我会收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 677, in scan_contents
return self.reader.scan_contents(column_indices,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_parquet.pyx", line 1389, in pyarrow._parquet.ParquetReader.scan_contents
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Malformed levels. min: 0 max: 4 out of range. Max Level: 3
parquet 文件模式由 pyarrow 本身生成。当我打印它时,我得到这个:
<pyarrow._parquet.ParquetSchema object at 0x12cc86d00>
required group field_id=-1 root {
optional group field_id=-1 hierarchy {
optional binary field_id=-1 content (String);
optional group field_id=-1 reference {
optional binary field_id=-1 type (String);
optional binary field_id=-1 value (String);
}
optional int64 field_id=-1 quantity;
optional group field_id=-1 children (List) {
repeated group field_id=-1 list {
optional group field_id=-1 {
optional binary field_id=-1 content (String);
optional group field_id=-1 reference {
optional binary field_id=-1 type (String);
optional binary field_id=-1 value (String);
}
optional int64 field_id=-1 quantity;
optional group field_id=-1 children (List) {
repeated group field_id=-1 list {
optional group field_id=-1 {
optional binary field_id=-1 content (String);
optional group field_id=-1 reference {
optional binary field_id=-1 type (String);
optional binary field_id=-1 value (String);
}
optional int64 field_id=-1 quantity;
repeated binary field_id=-1 children (String);
optional group field_id=-1 assets (List) {
repeated group field_id=-1 list {
optional group field_id=-1 {
optional binary field_id=-1 content (String);
optional binary field_id=-1 type (String);
}
}
}
repeated binary field_id=-1 sharing_units (String);
optional group field_id=-1 specific_data (List) {
repeated group field_id=-1 list {
optional group field_id=-1 {
optional group field_id=-1 target_organization {
optional binary field_id=-1 id (String);
}
optional binary field_id=-1 content (String);
}
}
}
repeated binary field_id=-1 validation_rules (String);
repeated binary field_id=-1 metadata (String);
}
}
}
repeated binary field_id=-1 assets (String);
repeated binary field_id=-1 sharing_units (String);
repeated binary field_id=-1 specific_data (String);
repeated binary field_id=-1 validation_rules (String);
repeated binary field_id=-1 metadata (String);
}
}
}
repeated binary field_id=-1 assets (String);
repeated binary field_id=-1 sharing_units (String);
repeated binary field_id=-1 specific_data (String);
repeated binary field_id=-1 validation_rules (String);
optional group field_id=-1 metadata (List) {
repeated group field_id=-1 list {
optional group field_id=-1 {
optional binary field_id=-1 id (String);
optional int64 field_id=-1 role;
optional binary field_id=-1 type (String);
}
}
}
}
}
我不明白为什么 pyarrow 无法读取其他工具生成的文件,并且我没有找到有关此错误的详细信息。
你有什么想法吗?
这表明提供的架构与文件中数据的编码方式不匹配。错误消息当然可以使用更多详细信息,但您可以通过尝试单独读取每一列来缩小正确的列范围,直到可以重现它为止。然后您可以查看这是否是代码中的错误(基于架构)或阅读器中的错误。
对于更多上下文,镶木地板使用重复和定义级别来编码嵌套数据。读取数据时,最大级别是根据模式计算的。然后,在读取时完成任何解码之前,Parquet 将验证页面的所有重复/定义级别是否在预期范围内。在本例中,当预期最多只有三个级别时,发现了值“4”的级别。