我正在尝试查找每个客户的前 3 个时间戳。
表的示例数据
customer_orders
:
客户 ID | 时间戳 |
---|---|
6778 | '2022-01-01' |
6778 | '2022-02-05' |
5544 | '2022-04-01' |
6778 | '2022-02-04' |
5544 | '2022-04-03' |
5544 | '2022-04-02' |
5544 | '2022-01-01' |
6778 | '2021-01-01' |
期望的结果:
客户 ID | 时间戳 |
---|---|
5544 | '2022-01-01' |
5544 | '2022-04-01' |
5544 | '2022-04-02' |
6778 | '2021-01-01' |
6778 | '2022-01-01' |
6778 | '2022-04-02' |
到目前为止我的查询:
SELECT
customer_id,
timestamp
FROM customer_orders
GROUP BY customer_id, timestamp
ORDER BY timestamp ASC
LIMIT 3
LIMIT 3
总共限制为 3 行。但我想要 3 行每个客户。
ROW_NUMBER()
来计算 PARTITION
中给定 customer_id
(在您的情况下为
CTE
)的列,并且在外部查询中只需过滤来自的 n 记录此生成的列:
WITH j AS (
SELECT customer_id, timestamp,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY timestamp
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS n
FROM customer_orders
)
SELECT customer_id, timestamp FROM j
WHERE n <= 3
ORDER BY customer_id, timestamp
db<>fiddle
@Jim 提供了一个有效的解决方案。但还有一些(不那么)微妙的性能细节。
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
是默认的窗框,拼出来就是噪音。 说明书:
默认的取景选项是
,即 与RANGE UNBOUNDED PRECEDING
相同。RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
还不值得另一个答案。但由于
row_number()
根据定义对 rows 进行操作,因此使用 ROWS
模式会更有效:
SELECT customer_id, timestamp
FROM (
SELECT row_number() OVER (PARTITION BY customer_id ORDER BY timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rn
, customer_id, timestamp
FROM customer_orders
) sub
WHERE rn <= 3
ORDER BY 1, 2;
通过对此索引进行索引(仅)扫描,任一查询都会快得多:
CREATE UNIQUE INDEX ON customer_orders (customer_id, timestamp);
我在 Postgres 13 和 14 上使用
ROWS
与 RANGE
模式进行了广泛的测试,并且 ROWS
在仅索引扫描中始终快了约 20%,在没有索引的情况下快了大约 5 - 10%。 (较高的固定成本会降低百分比。)对于最常用的窗口函数之一来说,这是一个相当大的启示! :)
我报告了这个问题,即将发布的 Postgres 16 得到了修复!参见:
也就是说,如果您的表很大并且每个客户有很多行,则不同的查询样式会快得多。我说的是数量级。我们需要与上面相同的索引。
理想情况下,您有一个表
customers
,每个相关 customer_id
恰好有一行。如果您没有它,请创建它。然后:
SELECT c.customer_id, o.timestamp
FROM customers c
CROSS JOIN LATERAL (
SELECT timestamp
FROM customer_orders o
WHERE o.customer_id = c.customer_id
ORDER BY o.timestamp
LIMIT 3
) o
ORDER BY 1, 2;
db<>小提琴这里
相关: