SQL Server 压缩相邻日期范围

问题描述 投票:0回答:3

我有一个表,其中包含人员 ID 和日期范围(开始日期和停止日期)。每个人可能有多行,有多个开始和结束日期。

create table #DateRanges (
   tableID   int not null,
   personID  int not null,
   startDate date,
   endDate   date
);
insert #DateRanges (tableID, personID, startDate, endDate)
values (1, 100, '2011-01-01', '2011-01-31') -- Just January
     , (2, 100, '2011-02-01', '2011-02-28') -- Just February
     , (3, 100, '2011-04-01', '2011-04-30') -- April - Skipped March
     , (4, 100, '2011-05-01', '2011-05-31') -- May
     , (5, 100, '2011-06-01', '2011-12-31') -- June through December

我需要一种方法来折叠相邻的日期范围(其中前一行的结束日期恰好是下一行的开始日期的前一天)。但它必须包括所有连续的范围,仅当结束到开始的差距大于一天时才进行分割。上述数据需要压缩为:

+-----------+----------+--------------+------------+
| SomeNewID | PersonID | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
|        1  |     100  |   2011-01-01 | 2011-02-28 |
+-----------+----------+--------------+------------+
|        2  |     100  |   2011-04-01 | 2011-12-31 |
+-----------+----------+--------------+------------+

只有两行,因为唯一缺少的范围是三月。现在,如果所有行军都存在,无论是一行还是多行,压缩将只产生一行。但如果三月中旬只有两天,我们将得到第三行来显示三月的日期。

我一直在使用 SQL 2016 中的 LEAD 和 LAG 函数来尝试将其作为记录集操作来完成,但到目前为止还是空的。我希望能够在没有循环和 RBAR 的情况下做到这一点,但我没有看到解决方案。

sql-server compression range lag lead
3个回答
0
投票

您可以使用滞后并获取正确的存储桶,然后按如下方式进行分组:

;with cte1 as (
    select *,dtdiff = datediff(day, lag(startdate, 1, null) over (partition by personid order by startdate), startDate) --Getting date difference for grouping
     from #DateRanges
        ),
cte2 as (
    select *, grp = sum(case when dtdiff is null or dtdiff>50 then 1 else 0 end) over (order by startdate) -- Creating bucket for min/max
        from cte1
        )
        select SomeNewId = Row_Number() over (order by (select null)), Personid, NewStartDate = min(startdate), NewEndDate = max(enddate) --Getting min/max based on bucket
            from cte2 group by PersonId, grp

您的输出:

+-----------+----------+--------------+------------+
| SomeNewId | Personid | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
|         1 |      100 | 2011-01-01   | 2011-02-28 |
|         2 |      100 | 2011-04-01   | 2011-12-31 |
+-----------+----------+--------------+------------+

我的测试输入:

insert #DateRanges (tableID, personID, startDate, endDate)
values (1, 100, '2011-01-01', '2011-01-31') -- Just January
     , (2, 100, '2011-02-01', '2011-02-28') -- Just February
     , (3, 100, '2011-04-01', '2011-04-30') -- April - Skipped March
     , (4, 100, '2011-05-01', '2011-05-31') -- May
     , (5, 100, '2011-06-01', '2011-06-30') -- More gaps
     , (6, 100, '2011-07-01', '2011-07-31') -- More gaps
     , (7, 100, '2011-08-01', '2011-08-31') -- More gaps
     , (8, 100, '2011-10-01', '2011-10-31') -- More gaps
     , (9, 100, '2011-11-01', '2011-11-30') -- More gaps

测试数据输出:

+-----------+----------+--------------+------------+
| SomeNewId | Personid | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
|         1 |      100 | 2011-01-01   | 2011-02-28 |
|         2 |      100 | 2011-04-01   | 2011-08-31 |
|         3 |      100 | 2011-10-01   | 2011-11-30 |
+-----------+----------+--------------+------------+

0
投票

经过几天的研究,我想我有一个想要分享的解决方案,以防其他人需要类似的东西。我使用一些 CTE 来查找提前时间、滞后时间和间隙时间,将行提取为仅重要的开始日期和停止日期,然后使用更多的提前时间和滞后时间来查找压缩的开始日期和停止日期。可能有一种更简单的方法,但我认为这可以很好地处理日级分辨率。

with LeadAndLagAndGap as (
   select
      tableid,
      personID,
      startDate,
      endDate,
      lag(endDate) over (partition by personID order by startDate) as previousEnd,
      lead(startDate) over (partition by personID order by startDate) as nextStart,
      coalesce(datediff(day,endDate,lead(startDate) over (partition by personID order by startDate))-1,0) as gap
   from
      #DateRanges
), OnlyStartAndEndRows as (
   select
      tableid,
      personID,
      startDate,
      endDate,
      previousEnd,
      nextStart,
      gap
   from
      LeadAndLagAndGap
   where
      previousEnd is null  -- Definitely FIRST record in a range
      or nextStart is null -- Definitely LAST record in a range
      or gap > 0           -- Definitely an end of a range, nextStart is definitely the start of a range.
), PreCollapseReaggregate as (
   select
      tableid,
      personID,
      startDate,
      endDate,
      previousEnd,
      nextStart,
      gap,
      case
         when previousEnd is null then startDate
         when gap > 0 then nextStart
      end as DefiniteStart,
      case
         when nextStart is null then endDate
         when gap > 0 then endDate
      end as DefiniteEnd
   from
      OnlyStartAndEndRows
), Collapsed as (
   select
      tableid,
      personID,
      DefiniteStart as startDate,
      case
         when definiteEnd is null or gap > 0 then lead(definiteEnd) over (partition by personid order by startdate)
         when definiteStart is not null and DefiniteEnd is not null then definiteEnd
      end as endDate
     from PreCollapseReaggregate
)
select * from Collapsed
where enddate is not null

0
投票

虽然这个问题已经很过时了,但我相信这个问题仍然值得回答。

如果您的数据保证范围不重叠,那么您可以通过应用 3 个步骤来压缩范围:

  1. 标记开始新范围的行
  2. 将所有行分配给其连续的范围序列
  3. 对每个序列的条目进行分组
  • 对于 (1),您可以在 startDate(以及任何分区条件)上使用 lag() 窗口函数来标记没有直接在前范围的所有行。
  • 对于(2),您使用 count() 作为窗口函数,来计算有多少个新的范围序列在当前行或之前开始,并将其用作序列 id。这里的技巧是限制你计算的行数
  • 对于 (3),您可以对序列和分区标准应用分组依据

这是 sqlite 的代码示例:

with RangesWithStart as (
  select
    *
    , coalesce( 
        date(
          lag(endDate) over (
            partition by personID 
            order by startDate
          )
          , '+1 day'
        ) <> startDate
        , true
      ) as isRangeStart
  from 
    DateRanges
)
, RangesWithSequences as (
  select 
    *
    , count(nullif(isRangeStart,false)) over (
        partition by personID 
        order by startDate 
        rows unbounded preceding
      ) as rangeSeqenceID 
  from 
    RangesWithStart
)
select 
  rangeSeqenceID as newID
  , personID
  , min(startDate) as startDate
  , max(endDate) as endDate
from RangesWithSequences
group by 1,2  
© www.soinside.com 2019 - 2024. All rights reserved.