我在 SQL 中有这个表(“颜色”):
CREATE TABLE colors (
color1 VARCHAR(50),
color2 VARCHAR(50),
year INT,
var1 INT,
var2 INT,
var3 INT,
var4 INT
);
INSERT INTO colors (color1, color2, year, var1, var2, var3, var4) VALUES
('red', 'blue', 2010, 1, 2, 1, 2),
('blue', 'red', 2010, 1, 2, 1, 2),
('red', 'blue', 2011, 1, 2, 5, 3),
('blue', 'red', 2011, 5, 3, 1, 2),
('orange', NULL, 2010, 5, 9, NULL, NULL)
('green', 'white', 2010, 5, 9, 6, 3);
表格如下所示:
color1 color2 year var1 var2 var3 var4
red blue 2010 1 2 1 2
blue red 2010 1 2 1 2
red blue 2011 1 2 5 3
blue red 2011 5 3 1 2
orange NULL 2010 5 9 NULL NULL
green white 2010 5 9 6 3
我正在尝试执行以下操作:
最终结果应该是这样的:
color1 color2 year var1 var2 var3 var4
red blue 2010 1 2 1 2
red blue 2011 1 2 5 3
blue red 2011 5 3 1 2
orange NULL 2010 5 9 NULL NULL
green white 2010 5 9 6 3
这是我尝试为此编写的 SQL 代码:
首先我编写 CTE 来识别对 - 然后验证 OR 条件:
WITH pairs AS (
SELECT *,
CASE
WHEN color1 < color2 THEN color1 || color2 || CAST(year AS VARCHAR(4))
ELSE color2 || color1 || CAST(year AS VARCHAR(4))
END AS pair_id
FROM colors
),
ranked_pairs AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY pair_id ORDER BY color1, color2) as row_num
FROM pairs
)
SELECT color1, color2, year, var1, var2, var3, var4
FROM ranked_pairs
WHERE row_num = 1 OR var1 != var3 OR var2 != var4;
输出如下所示:
color1 color2 year var1 var2 var3 var4
orange <NA> 2010 5 9 NA NA
blue red 2010 1 2 1 2
blue red 2011 5 3 1 2
red blue 2011 1 2 5 3
green white 2010 5 9 6 3
我这样做正确吗?最终结果看起来是正确的,但我不自信,e。此代码可能不适用于某些边缘情况。
谢谢!
如果同一对代表不同的顺序,则
pair_id
中的排序颜色似乎是错误的。此外,您还将空值视为相等。
请检查以下版本:
WITH pairs AS (
SELECT
color1,
color2,
year,
var1,
var2,
var3,
var4,
CASE
WHEN color1 < color2 THEN color1 || color2 || CAST(year AS VARCHAR(4))
ELSE color2 || color1 || CAST(year AS VARCHAR(4))
END AS pair_id
FROM colors
),
ranked_pairs AS (
SELECT
color1,
color2,
year,
var1,
var2,
var3,
var4,
ROW_NUMBER() OVER(PARTITION BY pair_id ORDER BY LEAST(color1, color2), GREATEST(color1, color2)) as row_num
FROM pairs
)
SELECT color1, color2, year, var1, var2, var3, var4
FROM ranked_pairs
WHERE row_num = 1 OR var1 IS DISTINCT FROM var3 OR var2 IS DISTINCT FROM var4;