在使用AWS Glue Catalog生成的表上查询AWS Athena时，“不支持类型LIST”

Question

我编写了一个ETL作业，将一堆JSON文件转换为存储在S3上的时间分区镶木地板文件（对象）。

我没有在AWS Athena上手动创建表并使用Athena数据目录，而是决定使用AWS Glue数据存储，它对已转换的镶木地板文件进行爬网并生成似乎正确的模式。它是：

CREATE EXTERNAL TABLE `table_fd2f388f79ee6`(
  `field1` string, 
  `field2` string, 
  `data` struct<attrib1:string,gpId:string,attrib2:boolean,attrib3:array<string>,attrib4:struct<f1:int,f2:int>>)
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://path'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='crawlername', 
  'averageRecordSize'='17', 
  'classification'='parquet', 
  'compressionType'='none', 
  'objectCount'='2', 
  'recordCount'='726', 
  'sizeKey'='287', 
  'typeOfData'='file')

然而，即使对于简单的select *查询我得到错误：

HIVE_CANNOT_OPEN_SPLIT：错误打开Hive split s3：//bucket/year=2018/month=07/day=03/part-00258-e1bcec61-f24e-40a2-8fac-fdd017054c2a.c000.snappy.parquet(offset = 0，length = 5356）：不支持列data.attrib类型LIST

这是一个错误，约束或我需要纠正的东西吗？

Answer 1

您的Athena表字段需要以与Parquet架构相同的顺序完全声明，否则它将失败！

如果你的镶木地板架构是：

id: integer (nullable = false)
rating: struct (nullable = true)
  related_to: struct (nullable = true)
       category: integer (nullable = false)
       name: float (nullable = true)
       type: string (nullable = false)
  rating_results: array (nullable = true)
       element: struct (containsNull = true)
            toto: integer (nullable = false)
            tata: float (nullable = true)
            titi: string (nullable = true)
other: string (nullable = true)

你athena表需要是：

`id` INT,
`rating` STRUCT<
                 `related_to`: STRUCT<
                         `category`: INT,
                         `name`: FLOAT,
                         `type`: STRING
                 >,
                 rating_results : ARRAY<
                            STRUCT<
                            toto: INT,
                            tata: FLOAT,
                            titi: STRING>
                            >
                 >,
`other` STRING

显然AWS ATHENA默认情况下不设置SERDE选项：

'hive.parquet.use-column-names' = 'true'

在WITH SERDEPROPERTIES中设置它时不要应用它

另外要小心，如果使用Spark导出镶木地板文件，请查看此选项

"spark.sql.parquet.writeLegacyFormat", true

更多细节：Query results difference between EMR-Presto and Athena

和最后的建议，小心十进制类型（它固定在presto但不在雅典娜）：https://github.com/prestodb/presto/issues/7232

在使用AWS Glue Catalog生成的表上查询AWS Athena时，“不支持类型LIST”

问题描述投票：2回答：1

1个回答

最新问题

在使用AWS Glue Catalog生成的表上查询AWS Athena时，“不支持类型LIST”

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1