在使用AWS Glue Catalog生成的表上查询AWS Athena时,“不支持类型LIST”

问题描述 投票:2回答:1

我编写了一个ETL作业,将一堆JSON文件转换为存储在S3上的时间分区镶木地板文件(对象)。

我没有在AWS Athena上手动创建表并使用Athena数据目录,而是决定使用AWS Glue数据存储,它对已转换的镶木地板文件进行爬网并生成似乎正确的模式。它是:

CREATE EXTERNAL TABLE `table_fd2f388f79ee6`(
  `field1` string, 
  `field2` string, 
  `data` struct<attrib1:string,gpId:string,attrib2:boolean,attrib3:array<string>,attrib4:struct<f1:int,f2:int>>)
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://path'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='crawlername', 
  'averageRecordSize'='17', 
  'classification'='parquet', 
  'compressionType'='none', 
  'objectCount'='2', 
  'recordCount'='726', 
  'sizeKey'='287', 
  'typeOfData'='file')

然而,即使对于简单的select *查询我得到错误:

HIVE_CANNOT_OPEN_SPLIT:错误打开Hive split s3://bucket/year=2018/month=07/day=03/part-00258-e1bcec61-f24e-40a2-8fac-fdd017054c2a.c000.snappy.parquet(offset = 0,length = 5356):不支持列data.attrib类型LIST

这是一个错误,约束或我需要纠正的东西吗?

amazon-web-services amazon-athena aws-glue
1个回答
0
投票

您的Athena表字段需要以与Parquet架构相同的顺序完全声明,否则它将失败!

如果你的镶木地板架构是:

id: integer (nullable = false)
rating: struct (nullable = true)
  related_to: struct (nullable = true)
       category: integer (nullable = false)
       name: float (nullable = true)
       type: string (nullable = false)
  rating_results: array (nullable = true)
       element: struct (containsNull = true)
            toto: integer (nullable = false)
            tata: float (nullable = true)
            titi: string (nullable = true)
other: string (nullable = true)

你athena表需要是:

`id` INT,
`rating` STRUCT<
                 `related_to`: STRUCT<
                         `category`: INT,
                         `name`: FLOAT,
                         `type`: STRING
                 >,
                 rating_results : ARRAY<
                            STRUCT<
                            toto: INT,
                            tata: FLOAT,
                            titi: STRING>
                            >
                 >,
`other` STRING

显然AWS ATHENA默认情况下不设置SERDE选项:

'hive.parquet.use-column-names' = 'true'

在WITH SERDEPROPERTIES中设置它时不要应用它


另外要小心,如果使用Spark导出镶木地板文件,请查看此选项

"spark.sql.parquet.writeLegacyFormat", true

更多细节:Query results difference between EMR-Presto and Athena

和最后的建议,小心十进制类型(它固定在presto但不在雅典娜):https://github.com/prestodb/presto/issues/7232

© www.soinside.com 2019 - 2024. All rights reserved.