有没有比我用 cte 做的“更好的代码”?
我想使用以下规则为每个 id 选择 M 和 F 之间最频繁的值=
这里有一个例子:
数据集
id | 性别 |
---|---|
1 | 男 |
1 | F |
2 | 男 |
2 | 男 |
2 | x |
2 | F |
3 | F |
3 | 0 |
4 | f |
4 | |
5 | d |
预期结果
id | 性别 |
---|---|
1 | 无定论 |
2 | 男 |
3 | F |
4 | F |
我做了什么:
WITH cte1 AS (
SELECT
id,
sex,
RANK() OVER (PARTITION BY id ORDER BY count(*)) rn
FROM dataset
WHERE sex ~* '(F|M)' AND sex IS NOT NULL
GROUP BY id, sex
),
cte2 AS (
SELECT id,
max(rn) AS max
FROM cte1
GROUP BY id
),
cte3 AS (
SELECT cte2.id,
sex
FROM cte2
LEFT JOIN cte1 ON cte2.id=cte1.id AND max=rn
WHERE cte1.id IS NOT NULL
),
cte4 AS (
SELECT id,
count(*) as cnt
FROM cte3
GROUP BY id
)
SELECT DISTINCT cte4.id,
CASE
WHEN cnt>1 THEN 'inconclusive'
WHEN cnt=1 AND SEX IN ('F', 'M') THEN sex
END AS sex
FROM cte4
LEFT JOIN cte3 ON cte4.id=cte3.id
对我来说,代码在某种意义上是高效的,它给出了适当的结果,但它看起来有点笨重,我正在寻找改进。有吗?
注意:我用过
DISTINCT ON ()
但它无法检索 id 1 = 不确定(F 或 M 取决于订单)
你似乎把这个复杂化了。
我会先过滤掉M或F以外的值,然后通过
id
聚合,统计每个值出现了多少次:
select id,
count(*) filter(where sex = 'M') cnt_m,
count(*) filter(where sex = 'F') cnt_f
from dataset
where sex in ('M', 'F')
group by id
我认为
where
子句中不需要正则表达式匹配,因为您似乎只想保留 'F'
和 'M'
值。
从那时起,我们所要做的就是比较计数。我们可以在外部查询中执行此操作,因此我们不需要重复条件表达式:
select id,
case when cnt_m > cnt_f then 'M'
when cnt_m < cnt_f then 'F'
else 'inconclusive'
end as res
from (
select id,
count(*) filter(where sex = 'M') cnt_m,
count(*) filter(where sex = 'F') cnt_f
from dataset
where sex in ('M', 'F')
group by id
) t