我有一个表,其中包含人员 ID 和日期范围(开始日期和停止日期)。每个人可能有多行,有多个开始和结束日期。
create table #DateRanges (
tableID int not null,
personID int not null,
startDate date,
endDate date
);
insert #DateRanges (tableID, personID, startDate, endDate)
values (1, 100, '2011-01-01', '2011-01-31') -- Just January
, (2, 100, '2011-02-01', '2011-02-28') -- Just February
, (3, 100, '2011-04-01', '2011-04-30') -- April - Skipped March
, (4, 100, '2011-05-01', '2011-05-31') -- May
, (5, 100, '2011-06-01', '2011-12-31') -- June through December
我需要一种方法来折叠相邻的日期范围(其中前一行的结束日期恰好是下一行的开始日期的前一天)。但它必须包括所有连续的范围,仅当结束到开始的差距大于一天时才进行分割。上述数据需要压缩为:
+-----------+----------+--------------+------------+
| SomeNewID | PersonID | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
| 1 | 100 | 2011-01-01 | 2011-02-28 |
+-----------+----------+--------------+------------+
| 2 | 100 | 2011-04-01 | 2011-12-31 |
+-----------+----------+--------------+------------+
只有两行,因为唯一缺少的范围是三月。现在,如果所有行军都存在,无论是一行还是多行,压缩将只产生一行。但如果三月中旬只有两天,我们将得到第三行来显示三月的日期。
我一直在使用 SQL 2016 中的 LEAD 和 LAG 函数来尝试将其作为记录集操作来完成,但到目前为止还是空的。我希望能够在没有循环和 RBAR 的情况下做到这一点,但我没有看到解决方案。
您可以使用滞后并获取正确的存储桶,然后按如下方式进行分组:
;with cte1 as (
select *,dtdiff = datediff(day, lag(startdate, 1, null) over (partition by personid order by startdate), startDate) --Getting date difference for grouping
from #DateRanges
),
cte2 as (
select *, grp = sum(case when dtdiff is null or dtdiff>50 then 1 else 0 end) over (order by startdate) -- Creating bucket for min/max
from cte1
)
select SomeNewId = Row_Number() over (order by (select null)), Personid, NewStartDate = min(startdate), NewEndDate = max(enddate) --Getting min/max based on bucket
from cte2 group by PersonId, grp
您的输出:
+-----------+----------+--------------+------------+
| SomeNewId | Personid | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
| 1 | 100 | 2011-01-01 | 2011-02-28 |
| 2 | 100 | 2011-04-01 | 2011-12-31 |
+-----------+----------+--------------+------------+
我的测试输入:
insert #DateRanges (tableID, personID, startDate, endDate)
values (1, 100, '2011-01-01', '2011-01-31') -- Just January
, (2, 100, '2011-02-01', '2011-02-28') -- Just February
, (3, 100, '2011-04-01', '2011-04-30') -- April - Skipped March
, (4, 100, '2011-05-01', '2011-05-31') -- May
, (5, 100, '2011-06-01', '2011-06-30') -- More gaps
, (6, 100, '2011-07-01', '2011-07-31') -- More gaps
, (7, 100, '2011-08-01', '2011-08-31') -- More gaps
, (8, 100, '2011-10-01', '2011-10-31') -- More gaps
, (9, 100, '2011-11-01', '2011-11-30') -- More gaps
测试数据输出:
+-----------+----------+--------------+------------+
| SomeNewId | Personid | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
| 1 | 100 | 2011-01-01 | 2011-02-28 |
| 2 | 100 | 2011-04-01 | 2011-08-31 |
| 3 | 100 | 2011-10-01 | 2011-11-30 |
+-----------+----------+--------------+------------+
经过几天的研究,我想我有一个想要分享的解决方案,以防其他人需要类似的东西。我使用一些 CTE 来查找提前时间、滞后时间和间隙时间,将行提取为仅重要的开始日期和停止日期,然后使用更多的提前时间和滞后时间来查找压缩的开始日期和停止日期。可能有一种更简单的方法,但我认为这可以很好地处理日级分辨率。
with LeadAndLagAndGap as (
select
tableid,
personID,
startDate,
endDate,
lag(endDate) over (partition by personID order by startDate) as previousEnd,
lead(startDate) over (partition by personID order by startDate) as nextStart,
coalesce(datediff(day,endDate,lead(startDate) over (partition by personID order by startDate))-1,0) as gap
from
#DateRanges
), OnlyStartAndEndRows as (
select
tableid,
personID,
startDate,
endDate,
previousEnd,
nextStart,
gap
from
LeadAndLagAndGap
where
previousEnd is null -- Definitely FIRST record in a range
or nextStart is null -- Definitely LAST record in a range
or gap > 0 -- Definitely an end of a range, nextStart is definitely the start of a range.
), PreCollapseReaggregate as (
select
tableid,
personID,
startDate,
endDate,
previousEnd,
nextStart,
gap,
case
when previousEnd is null then startDate
when gap > 0 then nextStart
end as DefiniteStart,
case
when nextStart is null then endDate
when gap > 0 then endDate
end as DefiniteEnd
from
OnlyStartAndEndRows
), Collapsed as (
select
tableid,
personID,
DefiniteStart as startDate,
case
when definiteEnd is null or gap > 0 then lead(definiteEnd) over (partition by personid order by startdate)
when definiteStart is not null and DefiniteEnd is not null then definiteEnd
end as endDate
from PreCollapseReaggregate
)
select * from Collapsed
where enddate is not null
虽然这个问题已经很过时了,但我相信这个问题仍然值得回答。
如果您的数据保证范围不重叠,那么您可以通过应用 3 个步骤来压缩范围:
这是 sqlite 的代码示例:
with RangesWithStart as (
select
*
, coalesce(
date(
lag(endDate) over (
partition by personID
order by startDate
)
, '+1 day'
) <> startDate
, true
) as isRangeStart
from
DateRanges
)
, RangesWithSequences as (
select
*
, count(nullif(isRangeStart,false)) over (
partition by personID
order by startDate
rows unbounded preceding
) as rangeSeqenceID
from
RangesWithStart
)
select
rangeSeqenceID as newID
, personID
, min(startDate) as startDate
, max(endDate) as endDate
from RangesWithSequences
group by 1,2