我有以下数据集:
DROP TABLE IF EXISTS #df
CREATE TABLE #df
(
PTID VARCHAR(10),
HospitalID VARCHAR(5),
Procedure_Dt date,
Check_In_Dt DATE,
);
INSERT INTO #df (PTID, HospitalID, Procedure_Dt, Check_In_Dt)
VALUES
('X0001', 'WY', '2021-07-25', '2021-07-23'),
('X0001', 'WY', '2021-07-25', '2021-10-24'),
('X0001', 'WY', '2021-07-25', '2021-10-27'),
('X0001', 'WY', '2021-07-25', '2021-06-24'),
('X0001', 'WY', '2021-07-25', '2022-06-10'),
('X0002', 'CA', '2022-08-25', '2022-08-26'),
('X0002', 'CA', '2022-08-25', '2022-08-27'),
('X0002', 'CA', '2022-08-25', '2022-08-29'),
('X0002', 'CA', '2022-08-25', '2022-09-22'),
('X0003', 'AL', '2023-02-02', NULL)
--SELECT * FROM #df
DROP TABLE IF EXISTS #df_datediff
;WITH CTE_datediff AS --Using only most recent quarter and year
(
SELECT PTID
, HospitalID
, Procedure_Dt
, Check_In_Dt
FROM #df
)
SELECT DISTINCT a.PTID
, HospitalID
, Procedure_Dt
, Check_In_Dt
, DATEDIFF(dd, CAST(Check_In_Dt AS DATE), Procedure_Dt) AS Date_Diff
INTO #df_datediff
FROM CTE_datediff a
我希望能够选择最接近手术日期的
Check_In_Date
。然而,这变得复杂,因为一些入住日期在手术日期之后,有些在手术日期之前。
最终我想要下面的最终数据集:
DROP TABLE IF EXISTS #df_final
CREATE TABLE #df_final
(
PTID VARCHAR(10),
HospitalID VARCHAR(5),
Procedure_Dt date,
Check_In_Dt DATE,
Date_Diff smallint
);
INSERT INTO #df_final (PTID, HospitalID, Procedure_Dt, Check_In_Dt, Date_Diff)
VALUES
('X0001', 'WY', '2021-07-25', '2021-07-23', 2),
('X0002', 'CA', '2022-08-25', '2022-08-26', -1)
('X0003', 'AL', '2023-02-02', NULL, NULL)
我试图通过编写以下代码来做到这一点:
SELECT a.PTID, HospitalID
, Procedure_Dt
, Check_In_Dt
, a.Date_Diff
FROM #df_datediff a
JOIN (SELECT PTID, MIN(Check_In_Dt) AS Check_In_Date FROM #df_datediff GROUP BY PTID) B
ON a.PTID = B.PTID
AND a.Check_In_Dt = B.Check_In_Date
UNION /*Since using MAX in the above query removes Null Facesheets, we use this union to include the null facesheet accesses*/
SELECT a.PTID, HospitalID
, Procedure_Dt
, Check_In_Dt
, a.Date_Diff
FROM #df_datediff a
WHERE Check_In_Dt IS NULL;
问题是,这为 PTID X0001 选择了“2021-06-24”的签入日期,而对于 PTID X0002,它选择了正确的最小负值“2022-08-26”。对于 X0001,应该选择“2021-07-23”
我的目标是将检查日期保持在手术前 0-40 天作为分子。分子中不应考虑所有其他入住日期。
任何提示将不胜感激。
这几乎是一个 top-n-per-group 类型的查询,尝试以下操作:
select Ptid, HospitalId, Procedure_Dt, Check_In_Dt
from (
select * ,
Row_Number()
over(partition by ptid, HospitalId
order by Abs(DateDiff(day, Procedure_Dt, Check_In_Dt))
) rn
from #df
)t
where rn = 1 and Check_In_Dt is not null;