选择每所学校的 N 个最新行,但跳过同一学生的重复行

问题描述 投票:0回答:2

在 Postgres 数据库中,我有一个名为

school_id_cards
的视图。该视图捕获学校系统中每个学校和学生的身份证列表。身份证会定期重新发放给学生,因此每个学生可能有任意数量的身份证。每个
card_id
都是独一无二的。一个学生可能属于多个学校。记录样本如下所示:

卡_id 学校_id 学生id
1 1 1
2 1 1
3 1 2
4 1 2
5 1 3
6 1 3
7 1 4
8 1 4
9 2 5
10 2 5
11 2 6
12 3 7

给定一个

school_ids
列表,我想检索每个学校+学生最近创建的身份证列表,仅限于每个
school_id
选定的学生数量。

我有以下查询来获取我需要的东西,没有任何限制:

SELECT card_id FROM school_id_cards
WHERE card_id IN (
  SELECT MAX(card_id) FROM school_id_cards
  WHERE
    school_id in (1, 2, 3)
  GROUP BY
    school_id,
    student_id
);

...对于上述示例,返回

(2,4,6,8,10,11,12)

但是,在我的子查询中,我想限制为 where 子句中列出的每个

school_id
返回的学校+学生记录的数量。例如,限制为 2,以获取学校 1 的最多 2 个最近添加的学生、学校 2 的最多 2 个最近添加的学生以及学校 3 的最多 2 个最近添加的学生。在这种情况下,最终结果为
(6,8,10,11,12)

有一个查询可以完成这个任务吗?

sql postgresql greatest-n-per-group distinct-on
2个回答
0
投票

您可以通过构建子查询的查询来做到这一点,并且

dense_rank

首先,获取每个学生/学校的最新卡片。 (注意:ID 不能很好地替代时间排序。添加日期时间列。

select 
  *,
  dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
from school_id_cards

接下来,我们使用它作为子查询来获取学校最近发给学生的卡片的顺序。最近的卡片有

student_card_order = 1

with ordered_student_cards as (
  select 
    *,
    dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
  from school_id_cards
)
select
  *,
  dense_rank() over(partition by school_id order by card_id desc) as school_card_order
from ordered_student_cards
where student_card_order = 1

最后,我们只能获取每所学校的前两个。

school_card_order <= 2;

with ordered_student_cards as (
  select 
    *,
    dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
  from school_id_cards
), ordered_school_cards as (
  select
    *,
    dense_rank() over(partition by school_id order by card_id desc) as school_card_order
  from ordered_student_cards
  where student_card_order = 1
)
select card_id
from ordered_school_cards
where school_card_order <= 2;

示范.

可能有更紧凑或更高效的方法来做到这一点,但是窗口函数和子查询是分解复杂查询的一种方法。


0
投票

如果您的表很大,您希望避免昂贵的全表顺序扫描。使用智能查询从匹配索引中选择具有索引(仅)扫描的合格行。速度快得多。

通常,您的数据库中应该存在某种“学校”表,每个相关学校只有一行。使查询更简单、更快:

WITH RECURSIVE latest_card AS (
   SELECT c.*
   FROM   school s
   CROSS  JOIN LATERAL (
      SELECT c.school_id, c.card_id, ARRAY[c.student_id] AS leading_ids
      FROM   school_id_cards c
      WHERE  c.school_id = s.school_id
      ORDER  BY c.card_id DESC
      LIMIT  1
      ) c

   UNION ALL
   SELECT c.*
   FROM   latest_card l
   JOIN   LATERAL (
      SELECT l.school_id, c.card_id, l.leading_ids || student_id
      FROM   school_id_cards c
      WHERE  c.school_id = l.school_id
      AND    c.card_id < l.card_id
      AND    c.student_id <> ALL (l.leading_ids)
      ORDER  BY c.card_id DESC
      LIMIT  1
      ) C ON cardinality(l.leading_ids) < 2  -- your limit per school here!
   )
SELECT card_id
FROM   latest_card
ORDER  BY card_id;

小提琴

正如您所演示的那样,这对于每所学校的小限制来说可以很好地扩展。对于较大的限制,我会切换到不同的查询。

关于递归CTE(rCTE)的使用:

确保有一个匹配的索引,例如

CREATE INDEX ON school_id_cards (school_id DESC, card_id DESC);

具有(默认)升序排序顺序的更简单的索引几乎没有任何糟糕。 Postgres 可以向后扫描 B 树索引。只有相反的排序顺序才不太理想。

如果没有

school
表:

WITH RECURSIVE latest_card AS (
   (
   SELECT DISTINCT ON (school_id)
          school_id, card_id, ARRAY[student_id] AS leading_ids
   FROM   school_id_cards c
   ORDER  BY school_id DESC, card_id DESC
   )

   UNION ALL
   SELECT c.*
   FROM   latest_card l
   JOIN   LATERAL (
      SELECT l.school_id, c.card_id, l.leading_ids || student_id
      FROM   school_id_cards c
      WHERE  c.school_id = l.school_id
      AND    c.card_id < l.card_id
      AND    c.student_id <> ALL (l.leading_ids)
      ORDER  BY c.card_id DESC
      LIMIT  1
      ) C ON cardinality(l.leading_ids) < 2  -- your limit per school here!
   )
SELECT card_id
FROM   latest_card
ORDER  BY card_id;

关于

DISTINCT ON

您可以用另一个嵌套的 rCTE 替换非递归项来生成学校列表(可能使用最新的卡片来启动)...
但确实应该有一张

school
桌子。如果没有,请创建它。

© www.soinside.com 2019 - 2024. All rights reserved.