我有下表:
SELECT * FROM labels LIMIT 5;
user_id | session_id | start_timestamp | end_timestamp | travelmode
---------+------------+------------------------+------------------------+------------
11 | 0 | 2007-06-26 12:32:29+01 | 2007-06-26 12:40:29+01 | bus
11 | 0 | 2008-03-31 17:00:08+01 | 2008-03-31 17:09:01+01 | taxi
11 | 0 | 2008-04-01 01:48:32+01 | 2008-04-01 01:59:23+01 | taxi
11 | 0 | 2008-04-01 02:00:22+01 | 2008-04-01 02:08:13+01 | walk
11 | 0 | 2008-04-01 12:22:47+01 | 2008-04-01 12:28:39+01 | taxi
(5 rows)
SELECT * FROM trajectories LIMIT 5;
user_id | session_id | timestamp | lat | lon | alt
---------+-------------------+------------------------+-----------+------------+------
11 | 10020080330004134 | 2008-03-30 00:41:34+00 | 36.032647 | 103.850612 | -777
11 | 10020080330004134 | 2008-03-30 00:42:05+00 | 36.031563 | 103.851273 | -777
11 | 10020080330004134 | 2008-03-30 00:43:04+00 | 36.028623 | 103.853238 | -777
11 | 10020080330004134 | 2008-03-30 00:44:03+00 | 36.027323 | 103.854475 | -777
11 | 10020080330004134 | 2008-03-30 00:45:02+00 | 36.025775 | 103.854993 | -777
(5 rows)
所以我想更新session_id
表的labels
列(最初全为零):
UPDATE labels
SET session_id=trajectories.session_id
FROM trajectories
WHERE
trajectories.user_id = labels.user_id
AND trajectories.timestamp >= labels.start_timestamp
AND trajectories.timestamp <= labels.end_timestamp;
UPDATE 4500
但是,在执行查询大约5分钟后,并没有更新labels
表的所有列(仅约30%):
SELECT COUNT(*) FROM labels WHERE session_id=0;
count
-------
10217
(1 row)
如果可能有帮助,请进一步了解表格:
\d labels
Table "akil.labels"
Column | Type | Collation | Nullable | Default
-----------------+--------------------------+-----------+----------+---------
user_id | integer | | not null |
session_id | bigint | | |
start_timestamp | timestamp with time zone | | not null |
end_timestamp | timestamp with time zone | | not null |
travelmode | text | | |
Indexes:
"mode_pkey" PRIMARY KEY, btree (user_id, start_timestamp, end_timestamp)
\d trajectories
Table "akil.trajectories"
Column | Type | Collation | Nullable | Default
------------+--------------------------+-----------+----------+---------
user_id | integer | | |
session_id | bigint | | not null |
timestamp | timestamp with time zone | | not null |
lat | double precision | | not null |
lon | double precision | | not null |
alt | double precision | | |
Indexes:
"trajectory_pkey" PRIMARY KEY, btree (session_id, "timestamp", lat, lon)
编辑
向trajectories
表添加索引:
CREATE INDEX idx_trecj ON trajectories (user_id, timestamp);
CREATE INDEX
UPDATE labels
SET session_id=trajectories.session_id
FROM trajectories
WHERE
trajectories.user_id = labels.user_id
AND trajectories.timestamp >= labels.start_timestamp
AND trajectories.timestamp <= labels.end_timestamp;
UPDATE 4500
SELECT COUNT(*) FROM labels WHERE session_id=0;
count
-------
10217
(1 row
但是,并不是session_id
中的所有labels
都被更新(与初始操作相同)。
对于trajectories
,您需要在(user_id, timestamp)
上建立索引。
这应该有助于您的表现。
您还存在可能存在多个匹配项的问题。在这种情况下,您当前的update
必须做的工作比必要的多。但是索引应该是一个很好的第一步。
search_path
您的表定义重新显示:
表“ akil.labels”
因此表位于模式akil
中。是否始终正确设置search_path
,以便您有能力在查询中忽略模式限定?否则,您可能是偶然/从错误的表中更新了错误的表-这可以解释您的结果。参见:
但是还有其他可能的解释。
akil.trajectories
中没有匹配项>>是什么使您认为trajectories
中的每一行都有适用的行?
很多人都没有:
适用行?如果labels
SELECT count(*) AS no_match_in_trajectories FROM akil.labels l WHERE NOT EXISTS ( SELECT FROM akil.trajectories t WHERE t.user_id = l.user_id AND t.timestamp >= l.start_timestamp AND t.timestamp <= l.end_timestamp );
中有多个匹配项>有关说明:是什么让您认为会有确切地是
akil.trajectories
中有多个适用行,则UPDATE
会产生任意结果(以昂贵的方式)。鉴于您的表定义,这将是正确的方法:
trajectories
关于可避免的空白更新:
UPDATE akil.labels l
SET session_id = t.session_id
FROM akil.labels l1
CROSS JOIN LATERAL (
SELECT t1.session_id
FROM akil.trajectories t1
WHERE t1.user_id = l1.user_id
AND t1.timestamp >= l1.start_timestamp
AND t1.timestamp <= l1.end_timestamp
ORDER BY t.timestamp -- pick the earliest matching entry
LIMIT 1
) t
WHERE ( l.user_id, l.start_timestamp, l.end_timestamp) -- PK
= (l1.user_id, l1.start_timestamp, l1.end_timestamp)
AND l.session_id IS DISTINCT FROM t.session_id; -- avoids empty updates
完美的索引是:
How do I (or can I) SELECT DISTINCT on multiple columns?
[CREATE INDEX foo ON akil.trajectories (user_id, timestamp) INCLUDE (session_id)
需要Postgres 11或更高版本。
您的整个设置看起来可疑地凌乱。
INCLUDE
是INCLUDE
PK列的名称。颇具误导性,表似乎已重命名?还是那里有东西?
mode_pkey
的PK:
labels
同样,非默认名称。首先,跨越4列的PK是可疑的。并且
trajectories
和"trajectory_pkey" PRIMARY KEY, btree (session_id, "timestamp", lat, lon)
是浮点数。这会导致重复操作出现舍入错误,从而导致各种混乱。
就像我评论过的:如果lat
为lon
,则该列可能应该为session_id
。