Postgres:如何优化在多个表上搜索列的相似性查询? (pg_trgm)

问题描述 投票:0回答:1

我正在尝试优化相似性 (

word_similarity
) 查询,该查询在 2 个表上搜索多个列。

表格及其索引如下所示:

CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Table definitions
CREATE TABLE usr (
    id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
    loc_id bigint,
    first_name text,
    last_name text
);
CREATE TABLE loc (
    id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
    country text,
    city text
);

-- Primary keys
ALTER TABLE ONLY usr
    ADD CONSTRAINT usr_pkey PRIMARY KEY (id);
ALTER TABLE ONLY loc
    ADD CONSTRAINT loc_pkey PRIMARY KEY (id);

-- Foreign key
ALTER TABLE ONLY usr.loc_id
    ADD CONSTRAINT usr_loc_id_fkey FOREIGN KEY (loc_id) REFERENCES loc(id);

-- Indices
CREATE INDEX usr_loc_id_idx ON usr
    USING btree (loc_id);
    
CREATE INDEX usr_first_name_last_name_idx ON usr
    USING gist ((first_name || ' ' || last_name) gist_trgm_ops);
CREATE INDEX loc_country_city_idx ON loc
    USING gist ((country || ' ' || city) gist_trgm_ops);

-- Insert dummy data
INSERT INTO loc (country, city)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25)
FROM generate_series(1, 100000) AS t;

INSERT INTO usr (first_name, last_name, loc_id)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25), trunc(random() * 99999 + 1)
FROM generate_series(1, 1000000) AS t;

当仅搜索一张表的列时,一切似乎都得到了很好的优化:

SELECT
    usr.id AS usr_id, usr.first_name, usr.last_name,
    loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
ORDER  BY 'baker' <<-> (usr.first_name || ' ' || usr.last_name)
LIMIT 20;

这会产生以下查询计划:

Limit  (cost=0.70..236.19 rows=20 width=118) (actual time=397.086..596.842 rows=1 loops=1)
  ->  Nested Loop Left Join  (cost=0.70..1178.16 rows=100 width=118) (actual time=397.084..596.839 rows=1 loops=1)
        ->  Index Scan using usr_first_name_last_name_idx on usr  (cost=0.41..422.41 rows=100 width=66) (actual time=397.038..596.792 rows=1 loops=1)
              Index Cond: (((first_name || ' '::text) || last_name) %> 'baker'::text)
              Order By: (((first_name || ' '::text) || last_name) <->> 'baker'::text)
        ->  Index Scan using loc_pkey on loc  (cost=0.29..7.55 rows=1 width=56) (actual time=0.037..0.037 rows=1 loops=1)
              Index Cond: (id = usr.loc_id)
Planning Time: 1.252 ms
Execution Time: 596.931 ms

我尝试编写一个搜索

usr
loc
表上的列的查询是:

SELECT
    usr.id AS usr_id, usr.first_name, usr.last_name,
    loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE
    'baker' <% (usr.first_name || ' ' || usr.last_name)
    OR
    'baker' <% (loc.country || ' ' || loc.city)
ORDER BY
    ('baker' <<-> (usr.first_name || ' ' || usr.last_name))
    +
    ('baker' <<-> (loc.country || ' ' || loc.city));

这个查询计划是:

Gather Merge  (cost=21071.24..21090.61 rows=166 width=118) (actual time=6297.322..6301.774 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Sort  (cost=20071.22..20071.42 rows=83 width=118) (actual time=6286.205..6286.208 rows=0 loops=3)
        Sort Key: ((('baker'::text <<-> ((usr.first_name || ' '::text) || usr.last_name)) + ('baker'::text <<-> ((loc.country || ' '::text) || loc.city))))
        Sort Method: quicksort  Memory: 25kB
        Worker 0:  Sort Method: quicksort  Memory: 25kB
        Worker 1:  Sort Method: quicksort  Memory: 25kB
        ->  Parallel Hash Left Join  (cost=2460.57..20068.57 rows=83 width=118) (actual time=4197.337..6284.058 rows=0 loops=3)
              Hash Cond: (usr.loc_id = loc.id)
              Filter: (('baker'::text <% ((usr.first_name || ' '::text) || usr.last_name)) OR ('baker'::text <% ((loc.country || ' '::text) || loc.city)))
              Rows Removed by Filter: 333335
              ->  Parallel Seq Scan on usr  (cost=0.00..16512.69 rows=416669 width=66) (actual time=0.078..214.231 rows=333335 loops=3)
              ->  Parallel Hash  (cost=1725.25..1725.25 rows=58825 width=56) (actual time=23.072..23.073 rows=33334 loops=3)
                    Buckets: 131072  Batches: 1  Memory Usage: 10432kB
                    ->  Parallel Seq Scan on loc  (cost=0.00..1725.25 rows=58825 width=56) (actual time=0.007..5.594 rows=33334 loops=3)
Planning Time: 3.470 ms
Execution Time: 6301.859 ms

有什么方法可以改进这个查询吗?或者这样的“多表相似性”查询在Postgres中不可能吗?

谢谢!

postgresql performance indexing query-optimization pg-trgm
1个回答
0
投票

看起来

OR
阻止了您的索引启动。您可以分别运行每个条件的选择,
UNION
,然后在
demo
之后应用您想要的 ORDER

EXPLAIN ANALYSE VERBOSE
SELECT * FROM
(   SELECT
        usr.id AS usr_id, usr.first_name, usr.last_name,
        loc.id AS loc_id, loc.country, loc.city
    FROM usr
    LEFT JOIN loc ON usr.loc_id = loc.id
    WHERE 'baker' <% (loc.country || ' ' || loc.city)
    UNION
    SELECT
        usr.id AS usr_id, usr.first_name, usr.last_name,
        loc.id AS loc_id, loc.country, loc.city
    FROM usr
    LEFT JOIN loc ON usr.loc_id = loc.id
    WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
) ORDER BY 
    ('baker' <<-> (first_name || ' ' || last_name))
    +
    ('baker' <<-> (country || ' ' || city));
| QUERY PLAN |
|:-----------|
| Sort  (cost=5112.22..5115.00 rows=1111 width=148) (actual time=34.165..34.171 rows=0 loops=1) |
|   Output: \_.usr\_id, \_.first\_name, \_.last\_name, \_.loc\_id, \_.country, \_.city, ((('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city)))) |
|   Sort Key: ((('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city)))) |
|   Sort Method: quicksort  Memory: 25kB |
|   ->  Subquery Scan on \_  (cost=5025.47..5056.02 rows=1111 width=148) (actual time=34.159..34.164 rows=0 loops=1) |
|         Output: \_.usr\_id, \_.first\_name, \_.last\_name, \_.loc\_id, \_.country, \_.city, (('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city))) |
|         ->  HashAggregate  (cost=5025.47..5036.58 rows=1111 width=144) (actual time=34.158..34.163 rows=0 loops=1) |
|               Output: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
|               Group Key: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
|               Batches: 1  Memory Usage: 73kB |
|               ->  Append  (cost=798.13..5008.80 rows=1111 width=144) (actual time=34.144..34.148 rows=0 loops=1) |
|                     ->  Hash Join  (cost=798.13..2240.77 rows=555 width=144) (actual time=16.292..16.294 rows=0 loops=1) |
|                           Output: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
|                           Inner Unique: true |
|                           Hash Cond: (usr.loc\_id = loc.id) |
|                           ->  Seq Scan on usr.usr  (cost=0.00..1296.75 rows=55575 width=80) (actual time=0.007..0.007 rows=1 loops=1) |
|                                 Output: usr.id, usr.loc\_id, usr.first\_name, usr.last\_name |
|                           ->  Hash  (cost=791.23..791.23 rows=552 width=72) (actual time=16.280..16.281 rows=0 loops=1) |
|                                 Output: loc.id, loc.country, loc.city |
|                                 Buckets: 1024  Batches: 1  Memory Usage: 8kB |
|                                 ->  Bitmap Heap Scan on usr.loc  (cost=104.56..791.23 rows=552 width=72) (actual time=16.279..16.280 rows=0 loops=1) |
|                                       Output: loc.id, loc.country, loc.city |
|                                       Filter: ('baker'::text \<% ((loc.country \|\| ' '::text) \|\| loc.city)) |
|                                       ->  Bitmap Index Scan on loc\_country\_city\_idx  (cost=0.00..104.42 rows=552 width=0) (actual time=16.275..16.275 rows=0 loops=1) |
|                                             Index Cond: (((loc.country \|\| ' '::text) \|\| loc.city) %> 'baker'::text) |
|                     ->  Hash Left Join  (cost=2029.53..2762.48 rows=556 width=144) (actual time=17.844..17.845 rows=0 loops=1) |
|                           Output: usr\_1.id, usr\_1.first\_name, usr\_1.last\_name, loc\_1.id, loc\_1.country, loc\_1.city |
|                           Inner Unique: true |
|                           Hash Cond: (usr\_1.loc\_id = loc\_1.id) |
|                           ->  Bitmap Heap Scan on usr.usr usr\_1  (cost=104.59..836.07 rows=556 width=80) (actual time=17.843..17.843 rows=0 loops=1) |
|                                 Output: usr\_1.id, usr\_1.loc\_id, usr\_1.first\_name, usr\_1.last\_name |
|                                 Filter: ('baker'::text \<% ((usr\_1.first\_name \|\| ' '::text) \|\| usr\_1.last\_name)) |
|                                 ->  Bitmap Index Scan on usr\_first\_name\_last\_name\_idx  (cost=0.00..104.45 rows=556 width=0) (actual time=17.838..17.839 rows=0 loops=1) |
|                                       Index Cond: (((usr\_1.first\_name \|\| ' '::text) \|\| usr\_1.last\_name) %> 'baker'::text) |
|                           ->  Hash  (cost=1234.42..1234.42 rows=55242 width=72) (never executed) |
|                                 Output: loc\_1.id, loc\_1.country, loc\_1.city |
|                                 ->  Seq Scan on usr.loc loc\_1  (cost=0.00..1234.42 rows=55242 width=72) (never executed) |
|                                       Output: loc\_1.id, loc\_1.country, loc\_1.city |
| Planning Time: 0.314 ms |
| Execution Time: 34.307 ms |
© www.soinside.com 2019 - 2024. All rights reserved.