我在 S3 存储桶中有一些 JSON 数据,该存储桶由多个文件夹分区,每个文件夹代表一个分区,其名称与数据添加到 S3 的日期时间相对应,结构如下:
bucket:
--- 2023-10-18-10-08/ (Folder containing data that was created at 10:08 on October 18th)
--- 2023-10-18-10-42/
--- 2023-10-18-11-10/
--- 2023-10-18-11-42/ (Folder containing data that was created at 11:42 on October 18th)
胶水爬行器每小时对存储桶运行一次,文件夹代表生成的选项卡中的分区。我想查询上一小时的数据(意味着如果
current_time
是12:05,我只想查询分区2023-10-18-11-10
和2023-10-18-11-42
的数据。
如何在给定的时间内实现这一目标?
如果使用 Athena 查询数据,类似于此的操作应该可以工作:
WITH dataset AS (
SELECT '2023-10-16-10-08' AS partition_column UNION ALL
SELECT '2023-10-18-09-42' UNION ALL
SELECT '2023-10-18-10-42' UNION ALL
SELECT '2023-10-18-11-10' UNION ALL
SELECT '2023-10-19-11-42'
)
SELECT *
FROM dataset
WHERE
CAST( -- Step 2: Convert the constructed timestamp string to timestamp type
CONCAT( -- Step 1: Convert 'YYYY-MM-DD-HH-MM' to an ISO 8601 timestamp string 'YYYY-MM-DDTHH:MI:SS'
SUBSTRING(partition_column, 1, 10), -- Gets 'YYYY-MM-DD'
' ',
SUBSTRING(partition_column, 12, 2), -- Gets 'HH'
':',
SUBSTRING(partition_column, 15, 2), -- Gets 'MM'
':00' -- Adds seconds
) AS timestamp
) BETWEEN
(date_trunc('hour', current_timestamp) - INTERVAL '1' HOUR) AND -- current time minus one hour
date_trunc('hour', current_timestamp) -- current time
;
此查询:
BETWEEN
,它将变成 2023-01-01 03:03:03
。所以 2023-01-01 02:00:00
子句将是: BETWEEN
WHERE partition_column BETWEEN 2023-01-01 02:00:00 AND 2023-01-01 03:00:00
返回 UTC 中的当前时间