Parquet / pyarrow:畸形关卡

问题描述 投票:0回答:1

我使用 Azure 流分析来转换 parquet 文件中的一些 json 文档。

对于大多数人来说,我可以在之后阅读它们,但对于其中一些人,我会收到以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 677, in scan_contents
    return self.reader.scan_contents(column_indices,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_parquet.pyx", line 1389, in pyarrow._parquet.ParquetReader.scan_contents
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Malformed levels. min: 0 max: 4 out of range.  Max Level: 3

parquet 文件模式由 pyarrow 本身生成。当我打印它时,我得到这个:

<pyarrow._parquet.ParquetSchema object at 0x12cc86d00>
required group field_id=-1 root {
  optional group field_id=-1 hierarchy {
    optional binary field_id=-1 content (String);
    optional group field_id=-1 reference {
      optional binary field_id=-1 type (String);
      optional binary field_id=-1 value (String);
    }
    optional int64 field_id=-1 quantity;
    optional group field_id=-1 children (List) {
      repeated group field_id=-1 list {
        optional group field_id=-1  {
          optional binary field_id=-1 content (String);
          optional group field_id=-1 reference {
            optional binary field_id=-1 type (String);
            optional binary field_id=-1 value (String);
          }
          optional int64 field_id=-1 quantity;
          optional group field_id=-1 children (List) {
            repeated group field_id=-1 list {
              optional group field_id=-1  {
                optional binary field_id=-1 content (String);
                optional group field_id=-1 reference {
                  optional binary field_id=-1 type (String);
                  optional binary field_id=-1 value (String);
                }
                optional int64 field_id=-1 quantity;
                repeated binary field_id=-1 children (String);
                optional group field_id=-1 assets (List) {
                  repeated group field_id=-1 list {
                    optional group field_id=-1  {
                      optional binary field_id=-1 content (String);
                      optional binary field_id=-1 type (String);
                    }
                  }
                }
                repeated binary field_id=-1 sharing_units (String);
                optional group field_id=-1 specific_data (List) {
                  repeated group field_id=-1 list {
                    optional group field_id=-1  {
                      optional group field_id=-1 target_organization {
                        optional binary field_id=-1 id (String);
                      }
                      optional binary field_id=-1 content (String);
                    }
                  }
                }
                repeated binary field_id=-1 validation_rules (String);
                repeated binary field_id=-1 metadata (String);
              }
            }
          }
          repeated binary field_id=-1 assets (String);
          repeated binary field_id=-1 sharing_units (String);
          repeated binary field_id=-1 specific_data (String);
          repeated binary field_id=-1 validation_rules (String);
          repeated binary field_id=-1 metadata (String);
        }
      }
    }
    repeated binary field_id=-1 assets (String);
    repeated binary field_id=-1 sharing_units (String);
    repeated binary field_id=-1 specific_data (String);
    repeated binary field_id=-1 validation_rules (String);
    optional group field_id=-1 metadata (List) {
      repeated group field_id=-1 list {
        optional group field_id=-1  {
          optional binary field_id=-1 id (String);
          optional int64 field_id=-1 role;
          optional binary field_id=-1 type (String);
        }
      }
    }
  }
}

我不明白为什么 pyarrow 无法读取其他工具生成的文件,并且我没有找到有关此错误的详细信息。

你有什么想法吗?

python parquet pyarrow
1个回答
0
投票

这表明提供的架构与文件中数据的编码方式不匹配。错误消息当然可以使用更多详细信息,但您可以通过尝试单独读取每一列来缩小正确的列范围,直到可以重现它为止。然后您可以查看这是否是代码中的错误(基于架构)或阅读器中的错误。

对于更多上下文,镶木地板使用重复和定义级别来编码嵌套数据。读取数据时,最大级别是根据模式计算的。然后,在读取时完成任何解码之前,Parquet 将验证页面的所有重复/定义级别是否在预期范围内。在本例中,当预期最多只有三个级别时,发现了值“4”的级别。

© www.soinside.com 2019 - 2024. All rights reserved.