我目前正在开展一个项目,该项目将提供有关有效保险单的每月信息。可视化将在表格中完成。数据看起来像这样:
Policy Policy Sequence # Effective Date Expiration Date $ amount Record Insert Date
a 0 Jan-20 May-20 $1,000.00 1/1/2020
a 0 Jan-20 May-20 $1,500.00 1/1/2020
a 1 Jun-20 Dec-20 $2,000.00 6/1/2020
a 2 Jan-21 Feb-21 $2,500.00 1/1/2021
该表包含每月数百万个有效保单,每个保单都有可能因输入错误而重复的序列号。对于每个策略/序列号组合,我使用 Qualify 获取最近插入的记录。
由于我试图可视化每月数据,因此我需要打破日期范围,以便有一个每月日期包含当时有效的任何保单版本的美元金额。此外,此时可以删除策略/序列号,因此我们将所有内容加在一起。下面的目标数据集示例:
Month $ amount
Jan-20 $1,500.00
Feb-20 $1,500.00
Mar-20 $1,500.00
Apr-20 $1,500.00
May-20 $1,500.00
Jun-20 $2,000.00 <- Policy change here
Jul-20 $2,000.00
Aug-20 $2,000.00
Sep-20 $2,000.00
Oct-20 $2,000.00
Nov-20 $2,000.00
Dec-20 $2,000.00
Jan-21 $2,500.00 <- Here
Feb-21 $2,500.00
Mar-21 $2,500.00
到目前为止,我已将上面的策略表加入到包含单列月/年日期的基本日期表中。基准日期表如下所示:
Month
1/1/2020
2/1/2020
3/1/2020
...
完整的 SQL 在这里:
select
B.Date1,
ContractingFirm,
StateCode,
sum(PolicyCount) as PolicyCount,
sum(DollarAmount) as DollarAmount
from BaseTable B
join
(
select
PolicyNumber,
PolicySequence,
InsertDate,
date_trunc(month, EffectiveDate) as EffectiveDate,
date_trunc(month, ExpirationDate) as ExpirationDate,
sum(1) as PolicyCount,
sum(DollarAmount) as DollarAmount,
from Policy_Table
group by
PolicyNumber,
PolicySequence,
InsertDate,
EffectiveDate,
ExpirationDate
qualify row_number() over (partition by PolicyNumber, PolicySequence order by InsertDate desc) = 1
) PolicyTable on Date1 between PolicyTable.EffectiveDate and PolicyTable.ExpirationDate
where B.Date1 between '2020-01-01' and '2021-03-01'
group by
B.Date1,
ContractingFirm,
StateCode
这个查询非常慢。看来连接是速度变慢的地方,一个月需要 5-10 分钟才能运行。数千行的美元金额也达到数万亿美元,不确定这是否会成为问题?有人对如何优化这个有任何想法吗?我觉得必须有一种更好的方法将月份迭代到单独的行中,而不是使用那个连接!
感谢您的阅读:D
您不想汇总不在开始日期和结束日期之间的数据,因此不要包含这些行。如果您使用 array_generate_range 函数,您实际上也不需要日期表。像这样的东西(不准确,因为您没有指出 ContractingFirm 和 StateCode 的起源):
with t0 as (
select
'2020-01-01' as start_date,
'2021-03-01' as end_date,
datediff("months", start_date, end_date) + 1 range
), t1(bdate1) as (
select dateadd("months", a.index, t0.start_date)
from t0, lateral flatten(array_generate_range(0, t0.range)) a
), t2 as (
select
PolicyNumber,
PolicySequence,
InsertDate,
date_trunc(month, EffectiveDate) as EffectiveDate,
date_trunc(month, ExpirationDate) as ExpirationDate,
sum(1) as PolicyCount,
sum(DollarAmount) as DollarAmount,
from Policy_Table join t1
on t1.bdate1 between EffectiveDate and ExpirationDate
group by
PolicyNumber,
PolicySequence,
InsertDate,
EffectiveDate,
ExpirationDate
qualify row_number() over (partition by PolicyNumber, PolicySequence order by InsertDate desc) = 1
)
select
bDate1,
ContractingFirm,
StateCode,
sum(PolicyCount) as PolicyCount,
sum(DollarAmount) as DollarAmount
from t2
group by all