SQLite3查询优化连接与子选择

问题描述 投票:0回答:6

我正在尝试找出最好的方法(在这种情况下可能并不重要)根据标志的存在以及另一个表中的行中的关系 id 来查找一个表的行。

以下是架构:

    CREATE TABLE files (
id INTEGER PRIMARY KEY,
dirty INTEGER NOT NULL);

    CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

我正在使用SQLite3

文件表会非常大,通常有 10K-5M 行。 简历积分会很小<10K with only 1-2 distinct

scan_file_id

所以我的第一个想法是:

select distinct files.* from resume_points inner join files
on resume_points.scan_file_id=files.id where files.dirty = 1;

一位同事建议扭转连接:

select distinct files.* from files inner join resume_points
on files.id=resume_points.scan_file_id where files.dirty = 1;

然后我想既然我们知道不同的

scan_file_id
的数量会如此之小,也许子选择是最佳的(在这种罕见的情况下):

select * from files where id in (select distinct scan_file_id from resume_points);

explain
输出具有以下行:分别为 42、42 和 48。

sql database sqlite query-optimization
6个回答
12
投票

TL;DR:最好的查询和索引是:

create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;

由于我通常使用 SQL Server,起初我认为查询优化器肯定会为这样一个简单的查询找到最佳执行计划,无论您以哪种方式编写这些等效的 SQL 语句。所以我下载了 SQLite,并开始玩。令我惊讶的是,性能存在巨大差异。

这是设置代码:

CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);

CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

我考虑了两个指数:

create index dirtyFiles on files (dirty, id);
create index uniqueFiles on resume_points (scan_file_id);
create index fileLookup on files (id);

以下是我尝试过的查询以及在我的 i5 笔记本电脑上的执行时间。数据库文件大小仅为 200MB 左右,因为它没有任何其他数据。

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
    14000 ms with uniqueFiles and fileLookup

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles

看起来 SQLite 的查询优化器根本不是很先进。最好的查询首先将resume_points减少到少量行(测试用例中为两行。OP说它将是1-2。),然后查找文件以查看它是否脏。

dirtyFiles
索引对任何文件都没有太大影响。我认为这可能是因为测试表中数据的排列方式所致。它可能会对生产表产生影响。然而,差异并不算太大,因为查找次数不会太多。
uniqueFiles
确实有所不同,因为它可以将 10000 行的resume_points 减少到2 行,而无需扫描其中的大部分。
fileLookup
确实使一些查询稍微快一些,但不足以显着改变结果。值得注意的是,它使 group by 变得非常慢。总之,尽早减少结果集才能产生最大的差异。


1
投票

由于

files.id
是主键,请尝试
GROUP
ing
BY
此字段而不是检查
DISTINCT files.*

SELECT f.*
FROM resume_points rp
INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1
GROUP BY f.id

考虑性能的另一个选项是向

resume_points.scan_file_id
添加索引。

CREATE INDEX index_resume_points_scan_file_id ON resume_points (scan_file_id)

1
投票

您可以尝试

exists
,这不会产生任何重复的
files

select * from files
where exists (
    select * from resume_points 
    where files.id = resume_points.scan_file_id
)
and dirty = 1;

当然它可能有助于拥有正确的索引:

files.dirty
resume_points.scan_file_id

索引是否有用取决于您的数据。


1
投票

我认为 jtseng 给出了解决方案。

select * from (select distinct scan_file_id from resume_points) d
join files on d.scan_file_id = files.id and files.dirty = 1

基本上与您发布的最后一个选项相同:

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;

这是因为你必须避免全表扫描/连接。

所以首先你需要 1-2 个不同的 ID:

select distinct scan_file_id from resume_points

此后,只需将 1-2 行连接到另一个表上,而不是全部 10K,这样可以优化性能。

如果您多次需要此声明,我会将其放入视图中。该视图不会改变性能,但看起来更干净/更易于阅读。

另请查看查询优化文档:http://www.sqlite.org/optoverview.html


0
投票

如果表“resume_points”只有一两个不同的文件id号,则似乎只需要一两行,并且似乎需要scan_file_id作为主键。那个表只有两列,id号没有意义。

如果是这样,您就不需要任何一个身份证号码。

pragma foreign_keys = on;
CREATE TABLE resume_points (
  scan_file_id integer primary key
);

CREATE TABLE files (
  scan_file_id integer not null references resume_points (scan_file_id),
  dirty INTEGER NOT NULL,
  primary key (scan_file_id, dirty)
);

现在您也不需要加入。只需查询“文件”表即可。


0
投票

在 sqlite-studio 中测试 100 万行

子查询:

SELECT l.id AS id, l.col1, l.col2, lr.col1 FROM links l 
JOIN links_root_folders lr ON l.lr_id = lr.id 
WHERE l.id NOT IN (SELECT c_id FROM processed WHERE c_type = 4) 
ORDER BY l.col4 ASC LIMIT 1;

[22:06:05] Query finished in 0.108 second(s).

加入查询:

SELECT l.id AS id, l.col1, l.col2, lr.col1 FROM links l 
JOIN links_root_folders lr ON l.lr_id = lr.id 
LEFT JOIN processed p ON p.c_id = l.id AND p.c_type = 4 
WHERE p.c_id IS NULL 
ORDER BY l.col4 ASC LIMIT 1;

[22:04:23] Query finished in 0.000 second(s).

我认为子查询加载主查询的sql语句中的所有结果,这就是它慢的原因。

© www.soinside.com 2019 - 2024. All rights reserved.