我正在尝试优化相似性 (
word_similarity
) 查询,该查询在 2 个表上搜索多个列。
表格及其索引如下所示:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- Table definitions
CREATE TABLE usr (
id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
loc_id bigint,
first_name text,
last_name text
);
CREATE TABLE loc (
id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
country text,
city text
);
-- Primary keys
ALTER TABLE ONLY usr
ADD CONSTRAINT usr_pkey PRIMARY KEY (id);
ALTER TABLE ONLY loc
ADD CONSTRAINT loc_pkey PRIMARY KEY (id);
-- Foreign key
ALTER TABLE ONLY usr.loc_id
ADD CONSTRAINT usr_loc_id_fkey FOREIGN KEY (loc_id) REFERENCES loc(id);
-- Indices
CREATE INDEX usr_loc_id_idx ON usr
USING btree (loc_id);
CREATE INDEX usr_first_name_last_name_idx ON usr
USING gist ((first_name || ' ' || last_name) gist_trgm_ops);
CREATE INDEX loc_country_city_idx ON loc
USING gist ((country || ' ' || city) gist_trgm_ops);
-- Insert dummy data
INSERT INTO loc (country, city)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25)
FROM generate_series(1, 100000) AS t;
INSERT INTO usr (first_name, last_name, loc_id)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25), trunc(random() * 99999 + 1)
FROM generate_series(1, 1000000) AS t;
当仅搜索一张表的列时,一切似乎都得到了很好的优化:
SELECT
usr.id AS usr_id, usr.first_name, usr.last_name,
loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
ORDER BY 'baker' <<-> (usr.first_name || ' ' || usr.last_name)
LIMIT 20;
这会产生以下查询计划:
Limit (cost=0.70..236.19 rows=20 width=118) (actual time=397.086..596.842 rows=1 loops=1)
-> Nested Loop Left Join (cost=0.70..1178.16 rows=100 width=118) (actual time=397.084..596.839 rows=1 loops=1)
-> Index Scan using usr_first_name_last_name_idx on usr (cost=0.41..422.41 rows=100 width=66) (actual time=397.038..596.792 rows=1 loops=1)
Index Cond: (((first_name || ' '::text) || last_name) %> 'baker'::text)
Order By: (((first_name || ' '::text) || last_name) <->> 'baker'::text)
-> Index Scan using loc_pkey on loc (cost=0.29..7.55 rows=1 width=56) (actual time=0.037..0.037 rows=1 loops=1)
Index Cond: (id = usr.loc_id)
Planning Time: 1.252 ms
Execution Time: 596.931 ms
我尝试编写一个搜索
usr
和 loc
表上的列的查询是:
SELECT
usr.id AS usr_id, usr.first_name, usr.last_name,
loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE
'baker' <% (usr.first_name || ' ' || usr.last_name)
OR
'baker' <% (loc.country || ' ' || loc.city)
ORDER BY
('baker' <<-> (usr.first_name || ' ' || usr.last_name))
+
('baker' <<-> (loc.country || ' ' || loc.city));
这个查询计划是:
Gather Merge (cost=21071.24..21090.61 rows=166 width=118) (actual time=6297.322..6301.774 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=20071.22..20071.42 rows=83 width=118) (actual time=6286.205..6286.208 rows=0 loops=3)
Sort Key: ((('baker'::text <<-> ((usr.first_name || ' '::text) || usr.last_name)) + ('baker'::text <<-> ((loc.country || ' '::text) || loc.city))))
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Parallel Hash Left Join (cost=2460.57..20068.57 rows=83 width=118) (actual time=4197.337..6284.058 rows=0 loops=3)
Hash Cond: (usr.loc_id = loc.id)
Filter: (('baker'::text <% ((usr.first_name || ' '::text) || usr.last_name)) OR ('baker'::text <% ((loc.country || ' '::text) || loc.city)))
Rows Removed by Filter: 333335
-> Parallel Seq Scan on usr (cost=0.00..16512.69 rows=416669 width=66) (actual time=0.078..214.231 rows=333335 loops=3)
-> Parallel Hash (cost=1725.25..1725.25 rows=58825 width=56) (actual time=23.072..23.073 rows=33334 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 10432kB
-> Parallel Seq Scan on loc (cost=0.00..1725.25 rows=58825 width=56) (actual time=0.007..5.594 rows=33334 loops=3)
Planning Time: 3.470 ms
Execution Time: 6301.859 ms
有什么方法可以改进这个查询吗?或者这样的“多表相似性”查询在Postgres中不可能吗?
谢谢!
看起来
OR
阻止了您的索引启动。您可以分别运行每个条件的选择,UNION
,然后在 demo之后应用您想要的
ORDER
:
EXPLAIN ANALYSE VERBOSE
SELECT * FROM
( SELECT
usr.id AS usr_id, usr.first_name, usr.last_name,
loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE 'baker' <% (loc.country || ' ' || loc.city)
UNION
SELECT
usr.id AS usr_id, usr.first_name, usr.last_name,
loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
) ORDER BY
('baker' <<-> (first_name || ' ' || last_name))
+
('baker' <<-> (country || ' ' || city));
| QUERY PLAN |
|:-----------|
| Sort (cost=5112.22..5115.00 rows=1111 width=148) (actual time=34.165..34.171 rows=0 loops=1) |
| Output: \_.usr\_id, \_.first\_name, \_.last\_name, \_.loc\_id, \_.country, \_.city, ((('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city)))) |
| Sort Key: ((('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city)))) |
| Sort Method: quicksort Memory: 25kB |
| -> Subquery Scan on \_ (cost=5025.47..5056.02 rows=1111 width=148) (actual time=34.159..34.164 rows=0 loops=1) |
| Output: \_.usr\_id, \_.first\_name, \_.last\_name, \_.loc\_id, \_.country, \_.city, (('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city))) |
| -> HashAggregate (cost=5025.47..5036.58 rows=1111 width=144) (actual time=34.158..34.163 rows=0 loops=1) |
| Output: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
| Group Key: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
| Batches: 1 Memory Usage: 73kB |
| -> Append (cost=798.13..5008.80 rows=1111 width=144) (actual time=34.144..34.148 rows=0 loops=1) |
| -> Hash Join (cost=798.13..2240.77 rows=555 width=144) (actual time=16.292..16.294 rows=0 loops=1) |
| Output: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
| Inner Unique: true |
| Hash Cond: (usr.loc\_id = loc.id) |
| -> Seq Scan on usr.usr (cost=0.00..1296.75 rows=55575 width=80) (actual time=0.007..0.007 rows=1 loops=1) |
| Output: usr.id, usr.loc\_id, usr.first\_name, usr.last\_name |
| -> Hash (cost=791.23..791.23 rows=552 width=72) (actual time=16.280..16.281 rows=0 loops=1) |
| Output: loc.id, loc.country, loc.city |
| Buckets: 1024 Batches: 1 Memory Usage: 8kB |
| -> Bitmap Heap Scan on usr.loc (cost=104.56..791.23 rows=552 width=72) (actual time=16.279..16.280 rows=0 loops=1) |
| Output: loc.id, loc.country, loc.city |
| Filter: ('baker'::text \<% ((loc.country \|\| ' '::text) \|\| loc.city)) |
| -> Bitmap Index Scan on loc\_country\_city\_idx (cost=0.00..104.42 rows=552 width=0) (actual time=16.275..16.275 rows=0 loops=1) |
| Index Cond: (((loc.country \|\| ' '::text) \|\| loc.city) %> 'baker'::text) |
| -> Hash Left Join (cost=2029.53..2762.48 rows=556 width=144) (actual time=17.844..17.845 rows=0 loops=1) |
| Output: usr\_1.id, usr\_1.first\_name, usr\_1.last\_name, loc\_1.id, loc\_1.country, loc\_1.city |
| Inner Unique: true |
| Hash Cond: (usr\_1.loc\_id = loc\_1.id) |
| -> Bitmap Heap Scan on usr.usr usr\_1 (cost=104.59..836.07 rows=556 width=80) (actual time=17.843..17.843 rows=0 loops=1) |
| Output: usr\_1.id, usr\_1.loc\_id, usr\_1.first\_name, usr\_1.last\_name |
| Filter: ('baker'::text \<% ((usr\_1.first\_name \|\| ' '::text) \|\| usr\_1.last\_name)) |
| -> Bitmap Index Scan on usr\_first\_name\_last\_name\_idx (cost=0.00..104.45 rows=556 width=0) (actual time=17.838..17.839 rows=0 loops=1) |
| Index Cond: (((usr\_1.first\_name \|\| ' '::text) \|\| usr\_1.last\_name) %> 'baker'::text) |
| -> Hash (cost=1234.42..1234.42 rows=55242 width=72) (never executed) |
| Output: loc\_1.id, loc\_1.country, loc\_1.city |
| -> Seq Scan on usr.loc loc\_1 (cost=0.00..1234.42 rows=55242 width=72) (never executed) |
| Output: loc\_1.id, loc\_1.country, loc\_1.city |
| Planning Time: 0.314 ms |
| Execution Time: 34.307 ms |