我正在寻找一种在 SQL 中执行非等值连接的方法,通过表 A 中的 col x 是否在表 B 中的 col y 给出的日期范围内进行连接。但是,表 B 有 多个每个 ID 的可能范围,表的数据格式为long。例如:
# Table A
# | person_id | breakfast_date | fruit_eaten_for_breakfast |
# |-----------|----------------|---------------------------|
# | 1 | 2023-03-12 | banana |
# | 1 | 2023-03-25 | apple |
# | 1 | 2023-04-01 | orange |
# | 1 | 2023-04-05 | kiwi |
# | 1 | 2023-04-22 | grapefruit |
# | 2 | 2024-12-15 | strawberry |
# | 2 | 2024-01-11 | blueberry |
# | 2 | 2024-02-12 | mango |
# | 2 | 2024-02-29 | watermelon |
# | 2 | 2024-03-10 | pear |
# Table B
# | person_id | period_start_and_end | period |
# |-----------|----------------------|--------|
# | 1 | 2023-03-15 | 1 | first period for user_id = 1 started
# | 1 | 2023-03-30 | 1 | on March 15 and ended on March 30.
# | 1 | 2023-04-02 | 2 |
# | 1 | 2023-04-10 | 2 |
# | 1 | 2023-04-12 | 3 |
# | 1 | 2023-04-20 | 3 |
# | 2 | 2024-01-01 | 1 |
# | 2 | 2024-01-05 | 1 |
# | 2 | 2024-02-10 | 2 | second period for user_id = 2 started
# | 2 | 2024-02-13 | 2 | on Feb 10 and ended on Feb 13.
关于表 B 👆的注意事项:每个
person_id
可以有一个或多个 period
,并且在编写 SQL 查询时我们无法知道(即我们不知道)每人有多少个周期。
# | person_id | breakfast_date | fruit_eaten_for_breakfast | period |
# |-----------|----------------|---------------------------|--------|
# | 1 | 2023-03-25 | apple | 1 |
# | 1 | 2023-04-05 | kiwi | 2 |
# | 2 | 2024-02-12 | mango | 2 |
我使用基于 Trino SQL 的 AWS Athena。
WITH
table_a AS (
SELECT * FROM (VALUES
(1, DATE('2023-03-12'), 'banana'),
(1, DATE('2023-03-25'), 'apple'),
(1, DATE('2023-04-01'), 'orange'),
(1, DATE('2023-04-05'), 'kiwi'),
(1, DATE('2023-04-22'), 'grapefruit'),
(2, DATE('2024-12-15'), 'strawberry'),
(2, DATE('2024-01-11'), 'blueberry'),
(2, DATE('2024-02-12'), 'mango'),
(2, DATE('2024-02-29'), 'watermelon'),
(2, DATE('2024-03-10'), 'pear')
) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
),
table_b AS (
SELECT * FROM (VALUES
(1, DATE('2023-03-15'), 1),
(1, DATE('2023-03-30'), 1),
(1, DATE('2023-04-02'), 2),
(1, DATE('2023-04-10'), 2),
(1, DATE('2023-04-12'), 3),
(1, DATE('2023-04-20'), 3),
(2, DATE('2024-01-01'), 1),
(2, DATE('2024-01-05'), 1),
(2, DATE('2024-02-10'), 2),
(2, DATE('2024-02-13'), 2)
) AS t(person_id, period_start_and_end, period)
)
嗯,不多。我熟悉的常规非等值连接过程在一列中包含开始日期,在另一列中包含结束日期。但在当前情况下,每个
person_id
有多个周期,而且问题更大——我们不知道有多少个周期。因此,即使我将表格“旋转”为宽格式,我仍然不知道如何找出每个person_id
的多个和未知周期。
您可以透视 table_b 以获取需要加入到 table_a 的时间范围
WITH
table_a AS (
SELECT * FROM (VALUES
(1, DATE('2023-03-12'), 'banana'),
(1, DATE('2023-03-25'), 'apple'),
(1, DATE('2023-04-01'), 'orange'),
(1, DATE('2023-04-05'), 'kiwi'),
(1, DATE('2023-04-22'), 'grapefruit'),
(2, DATE('2024-12-15'), 'strawberry'),
(2, DATE('2024-01-11'), 'blueberry'),
(2, DATE('2024-02-12'), 'mango'),
(2, DATE('2024-02-29'), 'watermelon'),
(2, DATE('2024-03-10'), 'pear')
) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
),
table_b AS (
SELECT * FROM (VALUES
(1, DATE('2023-03-15'), 1),
(1, DATE('2023-03-30'), 1),
(1, DATE('2023-04-02'), 2),
(1, DATE('2023-04-10'), 2),
(1, DATE('2023-04-12'), 3),
(1, DATE('2023-04-20'), 3),
(2, DATE('2024-01-01'), 1),
(2, DATE('2024-01-05'), 1),
(2, DATE('2024-02-10'), 2),
(2, DATE('2024-02-13'), 2)
) AS t(person_id, period_start_and_end, period)
), table_b_pivot as (
SELECT person_id,MIN(period_start_and_end) from_date, MAX(period_start_and_end) as To_date, period FROM table_b
GROUP BY person_id, period)
SELECT table_a.person_id, breakfast_date, fruit_eaten_for_breakfast, period
FROM table_a JOIN table_b_pivot
ON table_a.breakfast_date BETWEEN table_b_pivot.from_date AND table_b_pivot.To_date
AND table_a.person_id = table_b_pivot.person_id
ORDER BY table_a.person_id,period
person_id | 早餐_日期 | 早餐吃的水果 | 期 |
---|---|---|---|
1 | 2023-03-25 | 苹果 | 1 |
1 | 2023-04-05 | 猕猴桃 | 2 |
2 | 2024-02-12 | 芒果 | 2 |