在PostgreSQL中用where条件执行select的最佳方法。

Question

我目前正在为我的网站建立一个搜索，我正在努力使我的PostgreSQL查询达到一个合理的性能，这些查询表面上看起来很简单。

假设我有两个表。

订单(id, order_name, buyer, status),有10M条记录

购买的产品(id, order_id, product_reference)有20M行。

比方说，我现在正在寻找包含特定产品的前20个订单。

SELECT
  orders.id
FROM
  orders
INNER JOIN products_purchased ON products_purchased.order_id = orders.id
WHERE products_purchased.product_reference = 7
ORDER BY orders.id ASC
LIMIT 20

这个查询需要5到120秒的时间.然而，我在一些地方有所有适当的索引。

这似乎是由于ORDER BY子句的原因。

当我添加更多的产品时，问题会变得更大。例如，假设我在1年内添加了一个新的product_reference。如果我执行搜索来获取前20个订单，可能会花费更多的时间，因为需要扫描整个表来寻找前2个订单。

对于大型数据集，执行这种搜索的最佳实践是什么？

非常感谢您的帮助

补充数据 --------

我在需要的地方设置了索引。

orders.id
product_purchased.order_id
产品_购买的产品.产品编号

实际数据库大小为。

订单： 16M
已购产品：20M

例如选择所有product_reference=2000的订单需要120秒，尽管product_purchased表只有46000次出现product_reference=2000。

执行计划如下。

[
  {
    "Plan": {
      "Node Type": "Limit",
      "Parallel Aware": false,
      "Startup Cost": 0.87,
      "Total Cost": 9846.45,
      "Plan Rows": 20,
      "Plan Width": 4,
      "Actual Startup Time": 59750.428,
      "Actual Total Time": 77196.124,
      "Actual Rows": 20,
      "Actual Loops": 1,
      "Plans": [
        {
          "Node Type": "Nested Loop",
          "Parent Relationship": "Outer",
          "Parallel Aware": false,
          "Join Type": "Inner",
          "Startup Cost": 0.87,
          "Total Cost": 18802091.94,
          "Plan Rows": 38194,
          "Plan Width": 4,
          "Actual Startup Time": 59750.426,
          "Actual Total Time": 77196.101,
          "Actual Rows": 20,
          "Actual Loops": 1,
          "Inner Unique": true,
          "Plans": [
            {
              "Node Type": "Index Scan",
              "Parent Relationship": "Outer",
              "Parallel Aware": false,
              "Scan Direction": "Forward",
              "Index Name": "products_purchased_order_id_idx",
              "Relation Name": "products_purchased",
              "Alias": "products",
              "Startup Cost": 0.44,
              "Total Cost": 18746328.16,
              "Plan Rows": 38194,
              "Plan Width": 4,
              "Actual Startup Time": 59746.776,
              "Actual Total Time": 77171.904,
              "Actual Rows": 20,
              "Actual Loops": 1,
              "Filter": "(product_reference = 2000)",
              "Rows Removed by Filter": 514614
            },
            {
              "Node Type": "Index Only Scan",
              "Parent Relationship": "Inner",
              "Parallel Aware": false,
              "Scan Direction": "Forward",
              "Index Name": "orders_pkey",
              "Relation Name": "orders",
              "Alias": "orders",
              "Startup Cost": 0.43,
              "Total Cost": 1.46,
              "Plan Rows": 1,
              "Plan Width": 4,
              "Actual Startup Time": 1.197,
              "Actual Total Time": 1.197,
              "Actual Rows": 1,
              "Actual Loops": 20,
              "Index Cond": "(id = products_purchased.order_id)",
              "Rows Removed by Index Recheck": 0,
              "Heap Fetches": 10
            }
          ]
        }
      ]
    },
    "Planning Time": 7.893,
    "Triggers": [
    ],
    "Execution Time": 77196.878
  }
]

Answer 1

我曾试着重现你的问题，但我无法获得你所报告的高次数。我猜测我的数据分布与你的不一样，但还是差别太大。

我的设置。

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";


create table orders
(
    id         serial not null,
    order_name text   not null,

    constraint orders_pkey primary key (id)
);

create table products_purchased
(
    id                serial  not null,
    order_id          integer not null,
    product_reference integer not null,


    constraint products_purchased_fkey_order foreign key (order_id) references orders (id)
);

alter sequence orders_id_seq cache 100000;
alter sequence products_purchased_id_seq cache 100000;

insert into orders(order_name)
select uuid_generate_v4()
from generate_series(1, 16000000);

insert into products_purchased(order_id, product_reference)
select random() * 15999999 + 1, random() * 10000 + 1
from generate_series(1, 20000000);

alter sequence orders_id_seq cache 1;
alter sequence products_purchased_id_seq cache 1;

create index products_purchased_order_id on products_purchased using btree (order_id);
create index products_purchased_product_ref on products_purchased using btree (product_reference);

vacuum analyse;

如果你想要包含特定产品的前N个订单，你需要选择... ... DISTINCT 订单id，否则您可能会得到重复的订单。

基线（您的查询结果）。

SELECT DISTINCT o.id
FROM orders o
         INNER JOIN products_purchased p ON p.order_id = o.id
WHERE p.product_reference = 2000
ORDER BY o.id ASC
LIMIT 20;

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1000.90..9677.24 rows=20 width=4) (actual time=184.548..424.877 rows=20 loops=1)
   ->  Unique  (cost=1000.90..866032.55 rows=1994 width=4) (actual time=184.547..424.868 rows=20 loops=1)
         ->  Nested Loop  (cost=1000.90..866027.56 rows=1994 width=4) (actual time=184.546..424.837 rows=20 loops=1)
               ->  Gather Merge  (cost=1000.46..857325.28 rows=1994 width=4) (actual time=184.491..463.432 rows=20 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Parallel Index Scan using products_purchased_order_id on products_purchased p  (cost=0.44..856095.10 rows=831 width=4) (actual time=70.818..334.005 rows=8 loops=3)
                           Filter: (product_reference = 2000)
                           Rows Removed by Filter: 65639
               ->  Index Only Scan using orders_pkey on orders o  (cost=0.43..4.36 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=20)
                     Index Cond: (id = p.order_id)
                     Heap Fetches: 0
 Planning Time: 0.408 ms
 Execution Time: 463.962 ms
(14 rows)

假设你不能修改索引

那么你可以使用半连接，至少在我的DB上，它能使速度提高10倍。

SELECT o.id
FROM orders o
WHERE exists(SELECT 1 FROM products_purchased p WHERE p.product_reference = 2000 AND p.order_id = o.id)
ORDER BY o.id ASC
LIMIT 20;

                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8288.94..11689.62 rows=20 width=4) (actual time=23.580..39.330 rows=20 loops=1)
   ->  Gather Merge  (cost=8288.94..347337.07 rows=1994 width=4) (actual time=23.579..43.328 rows=20 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Merge Semi Join  (cost=7288.91..346106.89 rows=831 width=4) (actual time=13.660..28.522 rows=8 loops=3)
               Merge Cond: (o.id = p.order_id)
               ->  Parallel Index Only Scan using orders_pkey on orders o  (cost=0.43..322155.39 rows=6666680 width=4) (actual time=0.071..10.471 rows=52366 loops=3)
                     Heap Fetches: 0
               ->  Sort  (cost=7276.79..7281.77 rows=1994 width=4) (actual time=12.096..12.103 rows=23 loops=3)
                     Sort Key: p.order_id
                     Sort Method: quicksort  Memory: 194kB
                     Worker 0:  Sort Method: quicksort  Memory: 194kB
                     Worker 1:  Sort Method: quicksort  Memory: 194kB
                     ->  Bitmap Heap Scan on products_purchased p  (cost=40.02..7167.50 rows=1994 width=4) (actual time=1.618..10.885 rows=2074 loops=3)
                           Recheck Cond: (product_reference = 2000)
                           Heap Blocks: exact=2053
                           ->  Bitmap Index Scan on products_purchased_product_ref  (cost=0.00..39.52 rows=1994 width=0) (actual time=1.100..1.100 rows=2074 loops=3)
                                 Index Cond: (product_reference = 2000)
 Planning Time: 0.759 ms
 Execution Time: 43.426 ms
(20 rows)

Time: 44.853 ms

但如果你能修改索引，你可以创建更优化的索引。

create index products_purchased_idx on products_purchased using btree(product_reference, order_id);

那么你的原始查询会比半连接运行得更快：

                                                                               QUERY PLAN                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.87..88.79 rows=20 width=4) (actual time=0.109..0.291 rows=20 loops=1)
   ->  Unique  (cost=0.87..8766.60 rows=1994 width=4) (actual time=0.107..0.284 rows=20 loops=1)
         ->  Nested Loop  (cost=0.87..8761.62 rows=1994 width=4) (actual time=0.105..0.266 rows=20 loops=1)
               ->  Index Only Scan using products_purchased_idx on products_purchased p  (cost=0.44..59.33 rows=1994 width=4) (actual time=0.089..0.106 rows=20 loops=1)
                     Index Cond: (product_reference = 2000)
                     Heap Fetches: 0
               ->  Index Only Scan using orders_pkey on orders o  (cost=0.43..4.36 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=20)
                     Index Cond: (id = p.order_id)
                     Heap Fetches: 0
 Planning Time: 0.536 ms
 Execution Time: 0.365 ms
(11 rows)

Time: 1.368 ms

~0.4毫秒的执行时间与旧索引的~464毫秒相比，速度提高了约1100倍 :)

在PostgreSQL中用where条件执行select的最佳方法。

问题描述投票：0回答：1

1个回答

基线（您的查询结果）。

假设你不能修改索引

但如果你能修改索引，你可以创建更优化的索引。

最新问题

在PostgreSQL中用where条件执行select的最佳方法。

问题描述 投票：0回答：1

1个回答

基线（您的查询结果）。

假设你不能修改索引

但如果你能修改索引，你可以创建更优化的索引。

最新问题

问题描述投票：0回答：1