我编写了一个ETL作业,将一堆JSON文件转换为存储在S3上的时间分区镶木地板文件(对象)。
我没有在AWS Athena上手动创建表并使用Athena数据目录,而是决定使用AWS Glue数据存储,它对已转换的镶木地板文件进行爬网并生成似乎正确的模式。它是:
CREATE EXTERNAL TABLE `table_fd2f388f79ee6`(
`field1` string,
`field2` string,
`data` struct<attrib1:string,gpId:string,attrib2:boolean,attrib3:array<string>,attrib4:struct<f1:int,f2:int>>)
PARTITIONED BY (
`year` string,
`month` string,
`day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://path'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='crawlername',
'averageRecordSize'='17',
'classification'='parquet',
'compressionType'='none',
'objectCount'='2',
'recordCount'='726',
'sizeKey'='287',
'typeOfData'='file')
然而,即使对于简单的select *
查询我得到错误:
HIVE_CANNOT_OPEN_SPLIT:错误打开Hive split s3://bucket/year=2018/month=07/day=03/part-00258-e1bcec61-f24e-40a2-8fac-fdd017054c2a.c000.snappy.parquet(offset = 0,length = 5356):不支持列data.attrib类型LIST
这是一个错误,约束或我需要纠正的东西吗?
您的Athena表字段需要以与Parquet架构相同的顺序完全声明,否则它将失败!
如果你的镶木地板架构是:
id: integer (nullable = false)
rating: struct (nullable = true)
related_to: struct (nullable = true)
category: integer (nullable = false)
name: float (nullable = true)
type: string (nullable = false)
rating_results: array (nullable = true)
element: struct (containsNull = true)
toto: integer (nullable = false)
tata: float (nullable = true)
titi: string (nullable = true)
other: string (nullable = true)
你athena表需要是:
`id` INT,
`rating` STRUCT<
`related_to`: STRUCT<
`category`: INT,
`name`: FLOAT,
`type`: STRING
>,
rating_results : ARRAY<
STRUCT<
toto: INT,
tata: FLOAT,
titi: STRING>
>
>,
`other` STRING
显然AWS ATHENA默认情况下不设置SERDE选项:
'hive.parquet.use-column-names' = 'true'
在WITH SERDEPROPERTIES中设置它时不要应用它
另外要小心,如果使用Spark导出镶木地板文件,请查看此选项
"spark.sql.parquet.writeLegacyFormat", true
更多细节:Query results difference between EMR-Presto and Athena
和最后的建议,小心十进制类型(它固定在presto但不在雅典娜):https://github.com/prestodb/presto/issues/7232