我目前正在为我的网站建立一个搜索,我正在努力使我的PostgreSQL查询达到一个合理的性能,这些查询表面上看起来很简单。
假设我有两个表。
订单(id, order_name, buyer, status),有10M条记录
购买的产品(id, order_id, product_reference)有20M行。
比方说,我现在正在寻找包含特定产品的前20个订单。
SELECT
orders.id
FROM
orders
INNER JOIN products_purchased ON products_purchased.order_id = orders.id
WHERE products_purchased.product_reference = 7
ORDER BY orders.id ASC
LIMIT 20
这个查询需要5到120秒的时间.然而,我在一些地方有所有适当的索引。
这似乎是由于ORDER BY子句的原因。
当我添加更多的产品时,问题会变得更大。例如,假设我在1年内添加了一个新的product_reference。如果我执行搜索来获取前20个订单,可能会花费更多的时间,因为需要扫描整个表来寻找前2个订单。
对于大型数据集,执行这种搜索的最佳实践是什么?
非常感谢您的帮助
补充数据 --------
我在需要的地方设置了索引。
orders.id
product_purchased.order_id
产品_购买的产品.产品编号
实际数据库大小为。
订单: 16M
已购产品:20M
例如选择所有product_reference=2000的订单需要120秒,尽管product_purchased表只有46000次出现product_reference=2000。
执行计划如下。
[
{
"Plan": {
"Node Type": "Limit",
"Parallel Aware": false,
"Startup Cost": 0.87,
"Total Cost": 9846.45,
"Plan Rows": 20,
"Plan Width": 4,
"Actual Startup Time": 59750.428,
"Actual Total Time": 77196.124,
"Actual Rows": 20,
"Actual Loops": 1,
"Plans": [
{
"Node Type": "Nested Loop",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"Join Type": "Inner",
"Startup Cost": 0.87,
"Total Cost": 18802091.94,
"Plan Rows": 38194,
"Plan Width": 4,
"Actual Startup Time": 59750.426,
"Actual Total Time": 77196.101,
"Actual Rows": 20,
"Actual Loops": 1,
"Inner Unique": true,
"Plans": [
{
"Node Type": "Index Scan",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"Scan Direction": "Forward",
"Index Name": "products_purchased_order_id_idx",
"Relation Name": "products_purchased",
"Alias": "products",
"Startup Cost": 0.44,
"Total Cost": 18746328.16,
"Plan Rows": 38194,
"Plan Width": 4,
"Actual Startup Time": 59746.776,
"Actual Total Time": 77171.904,
"Actual Rows": 20,
"Actual Loops": 1,
"Filter": "(product_reference = 2000)",
"Rows Removed by Filter": 514614
},
{
"Node Type": "Index Only Scan",
"Parent Relationship": "Inner",
"Parallel Aware": false,
"Scan Direction": "Forward",
"Index Name": "orders_pkey",
"Relation Name": "orders",
"Alias": "orders",
"Startup Cost": 0.43,
"Total Cost": 1.46,
"Plan Rows": 1,
"Plan Width": 4,
"Actual Startup Time": 1.197,
"Actual Total Time": 1.197,
"Actual Rows": 1,
"Actual Loops": 20,
"Index Cond": "(id = products_purchased.order_id)",
"Rows Removed by Index Recheck": 0,
"Heap Fetches": 10
}
]
}
]
},
"Planning Time": 7.893,
"Triggers": [
],
"Execution Time": 77196.878
}
]
我曾试着重现你的问题,但我无法获得你所报告的高次数。我猜测我的数据分布与你的不一样,但还是差别太大。
我的设置。
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
create table orders
(
id serial not null,
order_name text not null,
constraint orders_pkey primary key (id)
);
create table products_purchased
(
id serial not null,
order_id integer not null,
product_reference integer not null,
constraint products_purchased_fkey_order foreign key (order_id) references orders (id)
);
alter sequence orders_id_seq cache 100000;
alter sequence products_purchased_id_seq cache 100000;
insert into orders(order_name)
select uuid_generate_v4()
from generate_series(1, 16000000);
insert into products_purchased(order_id, product_reference)
select random() * 15999999 + 1, random() * 10000 + 1
from generate_series(1, 20000000);
alter sequence orders_id_seq cache 1;
alter sequence products_purchased_id_seq cache 1;
create index products_purchased_order_id on products_purchased using btree (order_id);
create index products_purchased_product_ref on products_purchased using btree (product_reference);
vacuum analyse;
如果你想要包含特定产品的前N个订单,你需要选择... ... DISTINCT 订单id,否则您可能会得到重复的订单。
SELECT DISTINCT o.id
FROM orders o
INNER JOIN products_purchased p ON p.order_id = o.id
WHERE p.product_reference = 2000
ORDER BY o.id ASC
LIMIT 20;
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1000.90..9677.24 rows=20 width=4) (actual time=184.548..424.877 rows=20 loops=1)
-> Unique (cost=1000.90..866032.55 rows=1994 width=4) (actual time=184.547..424.868 rows=20 loops=1)
-> Nested Loop (cost=1000.90..866027.56 rows=1994 width=4) (actual time=184.546..424.837 rows=20 loops=1)
-> Gather Merge (cost=1000.46..857325.28 rows=1994 width=4) (actual time=184.491..463.432 rows=20 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using products_purchased_order_id on products_purchased p (cost=0.44..856095.10 rows=831 width=4) (actual time=70.818..334.005 rows=8 loops=3)
Filter: (product_reference = 2000)
Rows Removed by Filter: 65639
-> Index Only Scan using orders_pkey on orders o (cost=0.43..4.36 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=20)
Index Cond: (id = p.order_id)
Heap Fetches: 0
Planning Time: 0.408 ms
Execution Time: 463.962 ms
(14 rows)
那么你可以使用半连接,至少在我的DB上,它能使速度提高10倍。
SELECT o.id
FROM orders o
WHERE exists(SELECT 1 FROM products_purchased p WHERE p.product_reference = 2000 AND p.order_id = o.id)
ORDER BY o.id ASC
LIMIT 20;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8288.94..11689.62 rows=20 width=4) (actual time=23.580..39.330 rows=20 loops=1)
-> Gather Merge (cost=8288.94..347337.07 rows=1994 width=4) (actual time=23.579..43.328 rows=20 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Merge Semi Join (cost=7288.91..346106.89 rows=831 width=4) (actual time=13.660..28.522 rows=8 loops=3)
Merge Cond: (o.id = p.order_id)
-> Parallel Index Only Scan using orders_pkey on orders o (cost=0.43..322155.39 rows=6666680 width=4) (actual time=0.071..10.471 rows=52366 loops=3)
Heap Fetches: 0
-> Sort (cost=7276.79..7281.77 rows=1994 width=4) (actual time=12.096..12.103 rows=23 loops=3)
Sort Key: p.order_id
Sort Method: quicksort Memory: 194kB
Worker 0: Sort Method: quicksort Memory: 194kB
Worker 1: Sort Method: quicksort Memory: 194kB
-> Bitmap Heap Scan on products_purchased p (cost=40.02..7167.50 rows=1994 width=4) (actual time=1.618..10.885 rows=2074 loops=3)
Recheck Cond: (product_reference = 2000)
Heap Blocks: exact=2053
-> Bitmap Index Scan on products_purchased_product_ref (cost=0.00..39.52 rows=1994 width=0) (actual time=1.100..1.100 rows=2074 loops=3)
Index Cond: (product_reference = 2000)
Planning Time: 0.759 ms
Execution Time: 43.426 ms
(20 rows)
Time: 44.853 ms
create index products_purchased_idx on products_purchased using btree(product_reference, order_id);
那么你的原始查询会比半连接运行得更快:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.87..88.79 rows=20 width=4) (actual time=0.109..0.291 rows=20 loops=1)
-> Unique (cost=0.87..8766.60 rows=1994 width=4) (actual time=0.107..0.284 rows=20 loops=1)
-> Nested Loop (cost=0.87..8761.62 rows=1994 width=4) (actual time=0.105..0.266 rows=20 loops=1)
-> Index Only Scan using products_purchased_idx on products_purchased p (cost=0.44..59.33 rows=1994 width=4) (actual time=0.089..0.106 rows=20 loops=1)
Index Cond: (product_reference = 2000)
Heap Fetches: 0
-> Index Only Scan using orders_pkey on orders o (cost=0.43..4.36 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=20)
Index Cond: (id = p.order_id)
Heap Fetches: 0
Planning Time: 0.536 ms
Execution Time: 0.365 ms
(11 rows)
Time: 1.368 ms
~0.4毫秒的执行时间与旧索引的~464毫秒相比,速度提高了约1100倍 :)