我正在使用一个名为
orders
的 Postgres 表,它看起来像这样:
user_id product order_date
1 pants 7/1/2022
2 shirt 6/1/2022
1 socks 3/17/2023
3 pants 2/17/2023
4 shirt 3/13/2023
2 pants 8/15/2022
1 hat 4/15/2022
5 hat 3/14/2023
2 socks 12/3/2022
3 shirt 4/15/2023
4 socks 1/15/2023
4 pants 4/19/2023
5 shirt 5/2/2023
5 belt 5/15/2023
这是一个 dB Fiddle 数据:https://www.db-fiddle.com/f/uNGjP7gpKwdPGrJ7XmT7k3/2
我输出一个表格,显示客户订单的顺序:
user_id first_order second_order third_order
1 hat pants socks
2 shirt pants socks
3 pants shirt <null>
4 socks shirt pants
5 hat shirt belt
所以,比如顾客1先买了帽子,然后买了裤子,最后才买了袜子。
我想在行级别设置某种指示器,告诉我特定客户是否在购买另一种产品之前购买了一种产品。例如,我想指出客户是否在购买裤子之前购买了衬衫。
所需的输出如下所示:
user_id first_order second_order third_order shirt_before_pants
1 hat pants socks false
2 shirt pants socks true
3 pants shirt <null> false
4 socks shirt pants true
5 hat shirt belt false
有没有办法获取给定值在行级别的相对位置?
感谢您的帮助, -瑞秋
我们可以用
row_number()
枚举每个客户的订单,然后使用条件聚合生成新的列。要检查一种产品是否先于另一种购买,我们可以比较两种产品的最短订购日期:
select user_id,
max(product) filter(where rn = 1) product_1,
max(product) filter(where rn = 2) product_2,
max(product) filter(where rn = 3) product_3,
(
min(order_date) filter(where product = 'shirt')
< min(order_date) filter(where product = 'pants')
) shirt_before_pants
from (
select o.*, row_number() over(partition by user_id order by order_date) rn
from orders o
) o
group by user_id
该方法使用了窗函数
ROW_NUMBER
(DENSE_RANK也可以),为user_id聚合的每一行分配一个行号。要确定衬衫是否在裤子之前购买,我们可以比较这些产品生成的row_ids:
With cte as (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date) AS rn
FROM orders
)
select user_id, max(case when rn = 1 then product end) as first_order,
max(case when rn = 2 then product end) as second_order,
max(case when rn = 3 then product end) as third_order,
MAX(case when product = 'shirt' then rn end)
< MAX(case when product = 'pants' then rn end) as shirt_before_pants
from cte
GROUP BY user_id;
如果 ...
SELECT o.*
FROM users u
CROSS JOIN LATERAL (
SELECT o.user_id
, array_agg(o.product) AS products
, bool_or(o.combo) AS shirt_before_pants
FROM (
SELECT o.user_id, o.product::text
, o.product = 'pants' AND lag(o.product) OVER (ORDER BY o.order_date) = 'shirt' AS combo
FROM orders o
WHERE o.user_id = u.user_id
ORDER BY o.order_date
LIMIT 3 -- cutoff
) o
GROUP BY 1
) o
ORDER BY u.user_id;
它的美妙之处:只需在您的请求中为不同数量的订单更改
LIMIT
。并且只在一处更改“裤子”和“衬衫”。
由于子查询中的排序,输出数组中的产品已排序。参见:
如果您在 orders(user_id, order_date)
或更好的
orders(user_id, order_date) INCLUDE (product)
.上有索引,则查询对于每个用户有many 订单的大表执行良好
如果你没有
users
表(你应该有一个),像这样创建它:
CREATE TABLE users AS
SELECT DISTINCT user_id
FROM orders
ORDER BY user_id; -- optional
或阅读此处以获得更快的方式:
array_position
函数在这里可能会有帮助:
WITH
first_orders AS (
SELECT "user_id", "product", MIN("order_date") AS "order_date"
FROM "orders"
GROUP BY "user_id", "product"),
product_arrays AS (
SELECT "user_id",
array_agg(product ORDER BY order_date) AS "products"
FROM first_orders
GROUP BY "user_id")
SELECT *
FROM product_arrays
WHERE array_position(products, 'shirt')
< array_position(products, 'pants')
或者以下方法同样有效:
WITH
product_arrays AS (
SELECT "user_id",
array_agg(product ORDER BY order_date) AS "products"
FROM orders
GROUP BY "user_id")
SELECT *
FROM product_arrays
WHERE array_position(products, 'shirt')
< array_position(products, 'pants')