根据条件为每个 id 选择最频繁的值

问题描述 投票:0回答:1

有没有比我用 cte 做的“更好的代码”?

我想使用以下规则为每个 id 选择 M 和 F 之间最频繁的值=

  • 性别值不同于 ilike 'F' 或 'M',在频率计算中不考虑
  • 如果最频繁的计算不成功 = '不确定'
  • 如果最频繁的计算成功=性价值

这里有一个例子:

数据集

id 性别
1
1 F
2
2
2 x
2 F
3 F
3 0
4 f
4
5 d

预期结果

id 性别
1 无定论
2
3 F
4 F

我做了什么:

WITH cte1 AS (
                SELECT 
             id,
            sex,
            RANK() OVER (PARTITION BY id ORDER BY count(*)) rn
        FROM dataset
        WHERE sex ~* '(F|M)' AND sex IS NOT NULL
        GROUP BY id, sex
        ),
    cte2 AS (
         SELECT id,
                       max(rn) AS max
        FROM cte1
        GROUP BY id
        ),
    cte3 AS (
        SELECT cte2.id,
                       sex
        FROM cte2
        LEFT JOIN cte1 ON cte2.id=cte1.id AND max=rn
        WHERE cte1.id IS NOT NULL 
        ),
    cte4 AS (   
        SELECT id,
                      count(*) as cnt
        FROM cte3
        GROUP BY id
        )
SELECT DISTINCT cte4.id,
               CASE 
               WHEN cnt>1 THEN 'inconclusive'
               WHEN cnt=1 AND SEX IN ('F', 'M') THEN sex
               END AS sex
FROM cte4
LEFT JOIN cte3 ON cte4.id=cte3.id

对我来说,代码在某种意义上是高效的,它给出了适当的结果,但它看起来有点笨重,我正在寻找改进。有吗?

注意:我用过

DISTINCT ON ()
但它无法检索 id 1 = 不确定(F 或 M 取决于订单)

sql postgresql subquery aggregate-functions common-table-expression
1个回答
1
投票

你似乎把这个复杂化了。

我会先过滤掉M或F以外的值,然后通过

id
聚合,统计每个值出现了多少次:

select id, 
    count(*) filter(where sex = 'M') cnt_m,
    count(*) filter(where sex = 'F') cnt_f
from dataset
where sex in ('M', 'F')
group by id

我认为

where
子句中不需要正则表达式匹配,因为您似乎只想保留
'F'
'M'
值。

从那时起,我们所要做的就是比较计数。我们可以在外部查询中执行此操作,因此我们不需要重复条件表达式:

select id, 
    case when cnt_m > cnt_f then 'M'
         when cnt_m < cnt_f then 'F'
         else 'inconclusive'
    end as res
from (
    select id, 
        count(*) filter(where sex = 'M') cnt_m,
        count(*) filter(where sex = 'F') cnt_f
    from dataset
    where sex in ('M', 'F')
    group by id
) t
© www.soinside.com 2019 - 2024. All rights reserved.