如何在 SQL 中对每组多个时间范围进行非等值连接?

问题描述 投票:0回答:1

我正在寻找一种在 SQL 中执行非等值连接的方法,通过表 A 中的 col x 是否在表 B 中的 col y 给出的日期范围内进行连接。但是,表 B 有 多个每个 ID 的可能范围,表的数据格式为long。例如:

# Table A
# | person_id | breakfast_date | fruit_eaten_for_breakfast |
# |-----------|----------------|---------------------------|
# | 1         | 2023-03-12     | banana                    |
# | 1         | 2023-03-25     | apple                     |
# | 1         | 2023-04-01     | orange                    |
# | 1         | 2023-04-05     | kiwi                      |
# | 1         | 2023-04-22     | grapefruit                |
# | 2         | 2024-12-15     | strawberry                |
# | 2         | 2024-01-11     | blueberry                 |
# | 2         | 2024-02-12     | mango                     |
# | 2         | 2024-02-29     | watermelon                |
# | 2         | 2024-03-10     | pear                      |


# Table B
# | person_id | period_start_and_end | period |
# |-----------|----------------------|--------|
# | 1         | 2023-03-15           | 1      | first period for user_id = 1 started
# | 1         | 2023-03-30           | 1      | on March 15 and ended on March 30.
# | 1         | 2023-04-02           | 2      |
# | 1         | 2023-04-10           | 2      |
# | 1         | 2023-04-12           | 3      |
# | 1         | 2023-04-20           | 3      |
# | 2         | 2024-01-01           | 1      |
# | 2         | 2024-01-05           | 1      |
# | 2         | 2024-02-10           | 2      | second period for user_id = 2 started
# | 2         | 2024-02-13           | 2      | on Feb 10 and ended on Feb 13.

关于表 B 👆的注意事项:每个

person_id
可以有一个或多个
period
,并且在编写 SQL 查询时我们无法知道(即我们不知道)每人有多少个周期。

预期产出

# | person_id | breakfast_date | fruit_eaten_for_breakfast | period |
# |-----------|----------------|---------------------------|--------|
# | 1         | 2023-03-25     | apple                     | 1      |
# | 1         | 2023-04-05     | kiwi                      | 2      |
# | 2         | 2024-02-12     | mango                     | 2      |

SQL 方言

我使用基于 Trino SQL 的 AWS Athena。

可重复的数据

WITH

table_a AS (
  SELECT * FROM (VALUES
    (1, DATE('2023-03-12'), 'banana'),
    (1, DATE('2023-03-25'), 'apple'),
    (1, DATE('2023-04-01'), 'orange'),
    (1, DATE('2023-04-05'), 'kiwi'),
    (1, DATE('2023-04-22'), 'grapefruit'),
    (2, DATE('2024-12-15'), 'strawberry'),
    (2, DATE('2024-01-11'), 'blueberry'),
    (2, DATE('2024-02-12'), 'mango'),
    (2, DATE('2024-02-29'), 'watermelon'),
    (2, DATE('2024-03-10'), 'pear')
  ) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
),

table_b AS (
  SELECT * FROM (VALUES
    (1, DATE('2023-03-15'), 1),
    (1, DATE('2023-03-30'), 1),
    (1, DATE('2023-04-02'), 2),
    (1, DATE('2023-04-10'), 2),
    (1, DATE('2023-04-12'), 3),
    (1, DATE('2023-04-20'), 3),
    (2, DATE('2024-01-01'), 1),
    (2, DATE('2024-01-05'), 1),
    (2, DATE('2024-02-10'), 2),
    (2, DATE('2024-02-13'), 2)
  ) AS t(person_id, period_start_and_end, period)
)

到目前为止我已经尝试过的事情

嗯,不多。我熟悉的常规非等值连接过程在一列中包含开始日期,在另一列中包含结束日期。但在当前情况下,每个

person_id
有多个周期,而且问题更大——我们不知道有多少个周期。因此,即使我将表格“旋转”为宽格式,我仍然不知道如何找出每个
person_id
的多个和未知周期。

sql join amazon-athena presto trino
1个回答
0
投票

您可以透视 table_b 以获取需要加入到 table_a 的时间范围

WITH

table_a AS (
  SELECT * FROM (VALUES
    (1, DATE('2023-03-12'), 'banana'),
    (1, DATE('2023-03-25'), 'apple'),
    (1, DATE('2023-04-01'), 'orange'),
    (1, DATE('2023-04-05'), 'kiwi'),
    (1, DATE('2023-04-22'), 'grapefruit'),
    (2, DATE('2024-12-15'), 'strawberry'),
    (2, DATE('2024-01-11'), 'blueberry'),
    (2, DATE('2024-02-12'), 'mango'),
    (2, DATE('2024-02-29'), 'watermelon'),
    (2, DATE('2024-03-10'), 'pear')
  ) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
),

table_b AS (
  SELECT * FROM (VALUES
    (1, DATE('2023-03-15'), 1),
    (1, DATE('2023-03-30'), 1),
    (1, DATE('2023-04-02'), 2),
    (1, DATE('2023-04-10'), 2),
    (1, DATE('2023-04-12'), 3),
    (1, DATE('2023-04-20'), 3),
    (2, DATE('2024-01-01'), 1),
    (2, DATE('2024-01-05'), 1),
    (2, DATE('2024-02-10'), 2),
    (2, DATE('2024-02-13'), 2)
  ) AS t(person_id, period_start_and_end, period)
), table_b_pivot as (
SELECT person_id,MIN(period_start_and_end) from_date, MAX(period_start_and_end) as To_date, period FROM table_b
GROUP BY person_id, period)
SELECT table_a.person_id, breakfast_date, fruit_eaten_for_breakfast, period
FROM table_a JOIN table_b_pivot 
  ON table_a.breakfast_date BETWEEN table_b_pivot.from_date AND  table_b_pivot.To_date
AND table_a.person_id = table_b_pivot.person_id
ORDER BY table_a.person_id,period
person_id 早餐_日期 早餐吃的水果
1 2023-03-25 苹果 1
1 2023-04-05 猕猴桃 2
2 2024-02-12 芒果 2
© www.soinside.com 2019 - 2024. All rights reserved.