如何在AWS Athena分区表上查询

问题描述 投票:0回答:2

问题总结

当我尝试使用

SELECT
子句查询分区表时,Athena 会产生错误。

我的
WHERE

表中有4种分区类型。


    log
  • string
  • string
  • 小时
  • string
  • 
    
  • 我尝试对分区表进行
string

查询。 但收到以下错误消息。

错误信息

SELECT

我尝试过的 SELECT 查询

GENERIC_INTERNAL_ERROR: No value present This query ran against the "default" database, unless qualified by the query.

并且

SELECT * FROM logs WHERE year='2020' AND month='10' AND day ='05';

自从出现关于
SELECT * FROM "default"."logs" WHERE year='2020' AND month='10' AND day ='05';

的错误消息后,我检查了分区结果。

No value present

结果

SHOW PARTITIONS logs;

我非常感谢您的帮助。

更多信息

year=2020/month=10/day=05/hour=17 year=2020/month=10/day=05/hour=11 year=2020/month=10/day=05/hour=19 year=2020/month=10/day=05/hour=04 year=2020/month=10/day=05/hour=18 year=2020/month=10/day=05/hour=15 year=2020/month=10/day=05/hour=14 year=2020/month=10/day=05/hour=16 year=2020/month=10/day=05/hour=13 year=2020/month=10/day=05/hour=21 year=2020/month=10/day=05/hour=05 year=2020/month=10/day=05/hour=08 year=2020/month=10/day=05/hour=20 year=2020/month=10/day=05/hour=12 year=2020/month=10/day=05/hour=03 year=2020/month=10/day=05/hour=01 year=2020/month=10/day=05/hour=10 year=2020/month=10/day=05/hour=02 year=2020/month=10/day=05/hour=09 year=2020/month=10/day=05/hour=22 year=2020/month=10/day=05/hour=23 year=2020/month=10/day=05/hour=06 year=2020/month=10/day=05/hour=07 year=2020/month=10/day=05/hour=00 year=2020/month=10/day=04/hour=00

我使用的命令

创建表

CREATE TABLE


amazon-web-services amazon-athena partition
2个回答
11
投票

分区投影配置必须与表的分区键完全匹配。在您的情况下,表有四个分区键,分区投影配置提到了五个。除了四个类型不对之外,没有

CREATE EXTERNAL TABLE `logs`( `date` date, `time` string, `location` string, `bytes` bigint, `request_ip` string, `method` string, `host` string, `uri` string, `status` int, `referrer` string, `user_agent` string, `query_string` string, `cookie` string, `result_type` string, `request_id` string, `host_header` string, `request_protocol` string, `request_bytes` bigint, `time_taken` float, `xforwarded_for` string, `ssl_protocol` string, `ssl_cipher` string, `response_result_type` string, `http_version` string, `fle_status` string, `fle_encrypted_fields` int) PARTITIONED BY ( `year` string, `month` string, `day` string, `hour` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' WITH SERDEPROPERTIES ( 'input.regex'='^(?!#)([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)$') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://mybucket/path' TBLPROPERTIES ( 'projection.date.format'='yyyy/MM/dd', 'projection.date.interval'='1', 'projection.date.interval.unit'='DAYS', 'projection.date.range'='2019/11/27, NOW-1DAYS', 'projection.date.type'='date', 'projection.day.type'='string', 'projection.enabled'='true', 'projection.hour.type'='string', 'projection.month.type'='string', 'projection.year.type'='string', 'skip.header.line.count'='2', 'storage.location.template'='s3://mybucket/path/distributionID/${year}/${month}/${day}/${hour}/', 'transient_lastDdlTime'='1575005094')

分区投影类型。

您可以通过进行两项更改来解决问题。首先像这样更改分区键:

string

这会删除“年”、“月”和“日”分区键,取而代之的是“日期”键。仅仅因为它们是单独的“目录”而拥有单独的日期组件是没有必要的,仅仅拥有“日期”键将使查询更容易编写。

然后将表属性更改为:

PARTITIONED BY ( `date` string, `hour` string )

这告诉 Athena,“日期”分区键的类型为 
TBLPROPERTIES ( 'projection.date.format' = 'yyyy/MM/dd', 'projection.date.interval' = '1', 'projection.date.interval.unit' = 'DAYS', 'projection.date.range' = '2019/11/27, NOW-1DAYS', 'projection.date.type' = 'date', 'projection.hour.type' = 'integer', 'projection.hour.range' = '0-23', 'projection.hour.digits' = '2', 'projection.enabled' = 'true', 'storage.location.template'='s3://mybucket/path/distributionID/${date}/${hour}/', 'skip.header.line.count' = '2' )

,并且其格式为“YYYY/MM/DD”(对应于 S3 URI 中的格式,这很重要)。它还告诉 Athena,“小时”分区键是一个

date
,范围为 0-23,格式为两位数(即用零填充)。最后,它指定这些分区键如何映射到 S3 上的分区位置。当查询中的日期为“2020/10/06”时,该字符串将逐字插入位置模板中。
通过这些更改,您应该能够运行如下查询(“date”是保留字,当它是列名称时必须加引号):

integer

SELECT *
FROM logs
WHERE "date" = '2020/10/06'
注意日期格式必须与分区投影配置中的格式完全相同,即
SELECT * FROM logs WHERE "date" BETWEEN '2020/10/01' AND '2020/10/06' AND hour BETWEEN 9 AND 21

    


2
投票

YYYY/MM/DD

这是让它在这里发挥作用的关键。谢谢@theo

就我而言,对于镶木地板文件,我使用的是以下内容:

'projection.hour.digits' = '2'

© www.soinside.com 2019 - 2024. All rights reserved.