SQL - 如何优化两个日期之间的连接?

问题描述 投票:0回答:1

我目前正在开展一个项目,该项目将提供有关有效保险单的每月信息。可视化将在表格中完成。数据看起来像这样:

Policy  Policy Sequence #    Effective Date    Expiration Date    $ amount     Record Insert Date
 a       0                    Jan-20            May-20             $1,000.00    1/1/2020
 a       0                    Jan-20            May-20             $1,500.00    1/1/2020
 a       1                    Jun-20            Dec-20             $2,000.00    6/1/2020
 a       2                    Jan-21            Feb-21             $2,500.00    1/1/2021

该表包含每月数百万个有效保单,每个保单都有可能因输入错误而重复的序列号。对于每个策略/序列号组合,我使用 Qualify 获取最近插入的记录。

由于我试图可视化每月数据,因此我需要打破日期范围,以便有一个每月日期包含当时有效的任何保单版本的美元金额。此外,此时可以删除策略/序列号,因此我们将所有内容加在一起。下面的目标数据集示例:

Month   $ amount
Jan-20  $1,500.00
Feb-20  $1,500.00
Mar-20  $1,500.00
Apr-20  $1,500.00
May-20  $1,500.00
Jun-20  $2,000.00  <- Policy change here
Jul-20  $2,000.00
Aug-20  $2,000.00
Sep-20  $2,000.00
Oct-20  $2,000.00
Nov-20  $2,000.00
Dec-20  $2,000.00
Jan-21  $2,500.00  <- Here
Feb-21  $2,500.00
Mar-21  $2,500.00

到目前为止,我已将上面的策略表加入到包含单列月/年日期的基本日期表中。基准日期表如下所示:

Month
1/1/2020
2/1/2020
3/1/2020
...

完整的 SQL 在这里:

select
B.Date1, 
ContractingFirm,
StateCode,
sum(PolicyCount) as PolicyCount, 
sum(DollarAmount) as DollarAmount

from BaseTable B
    join
    (
    select 
    PolicyNumber,
    PolicySequence,
    InsertDate,
    date_trunc(month, EffectiveDate) as EffectiveDate, 
    date_trunc(month, ExpirationDate) as ExpirationDate, 
    sum(1) as PolicyCount, 
    sum(DollarAmount) as DollarAmount, 
   
    from Policy_Table
    group by 
    PolicyNumber, 
    PolicySequence,
    InsertDate,
    EffectiveDate,
    ExpirationDate

    qualify row_number() over (partition by PolicyNumber, PolicySequence order by InsertDate desc) = 1

    ) PolicyTable on Date1 between PolicyTable.EffectiveDate and PolicyTable.ExpirationDate

where B.Date1 between '2020-01-01' and '2021-03-01'

group by 
B.Date1, 
ContractingFirm,
StateCode

这个查询非常慢。看来连接是速度变慢的地方,一个月需要 5-10 分钟才能运行。数千行的美元金额也达到数万亿美元,不确定这是否会成为问题?有人对如何优化这个有任何想法吗?我觉得必须有一种更好的方法将月份迭代到单独的行中,而不是使用那个连接!

感谢您的阅读:D

sql snowflake-cloud-data-platform etl
1个回答
0
投票

您不想汇总不在开始日期和结束日期之间的数据,因此不要包含这些行。如果您使用 array_generate_range 函数,您实际上也不需要日期表。像这样的东西(不准确,因为您没有指出 ContractingFirm 和 StateCode 的起源):

with t0 as (
    select
        '2020-01-01' as start_date,
        '2021-03-01' as end_date,
        datediff("months", start_date, end_date) + 1 range
), t1(bdate1) as (
    select dateadd("months", a.index, t0.start_date)
    from t0, lateral flatten(array_generate_range(0, t0.range)) a
), t2 as (
select 
    PolicyNumber,
    PolicySequence,
    InsertDate,
    date_trunc(month, EffectiveDate) as EffectiveDate, 
    date_trunc(month, ExpirationDate) as ExpirationDate, 
    sum(1) as PolicyCount, 
    sum(DollarAmount) as DollarAmount, 
    from Policy_Table join t1
    on t1.bdate1 between EffectiveDate and ExpirationDate
    group by 
    PolicyNumber, 
    PolicySequence,
    InsertDate,
    EffectiveDate,
    ExpirationDate
    qualify row_number() over (partition by PolicyNumber, PolicySequence order by InsertDate desc) = 1
)
select
bDate1,
ContractingFirm,
StateCode,
sum(PolicyCount) as PolicyCount, 
sum(DollarAmount) as DollarAmount
from t2
group by all
© www.soinside.com 2019 - 2024. All rights reserved.