更新列的值

问题描述 投票:0回答:2

我有下表:

SELECT * FROM labels LIMIT 5;
 user_id | session_id |    start_timestamp     |     end_timestamp      | travelmode 
---------+------------+------------------------+------------------------+------------
      11 |          0 | 2007-06-26 12:32:29+01 | 2007-06-26 12:40:29+01 | bus
      11 |          0 | 2008-03-31 17:00:08+01 | 2008-03-31 17:09:01+01 | taxi
      11 |          0 | 2008-04-01 01:48:32+01 | 2008-04-01 01:59:23+01 | taxi
      11 |          0 | 2008-04-01 02:00:22+01 | 2008-04-01 02:08:13+01 | walk
      11 |          0 | 2008-04-01 12:22:47+01 | 2008-04-01 12:28:39+01 | taxi
(5 rows)

SELECT * FROM trajectories LIMIT 5;
 user_id |    session_id     |       timestamp        |    lat    |    lon     | alt  
---------+-------------------+------------------------+-----------+------------+------
      11 | 10020080330004134 | 2008-03-30 00:41:34+00 | 36.032647 | 103.850612 | -777
      11 | 10020080330004134 | 2008-03-30 00:42:05+00 | 36.031563 | 103.851273 | -777
      11 | 10020080330004134 | 2008-03-30 00:43:04+00 | 36.028623 | 103.853238 | -777
      11 | 10020080330004134 | 2008-03-30 00:44:03+00 | 36.027323 | 103.854475 | -777
      11 | 10020080330004134 | 2008-03-30 00:45:02+00 | 36.025775 | 103.854993 | -777
(5 rows)

所以我想更新session_id表的labels列(最初全为零):

UPDATE labels
SET session_id=trajectories.session_id
FROM trajectories
WHERE 
  trajectories.user_id = labels.user_id
  AND trajectories.timestamp >= labels.start_timestamp 
  AND trajectories.timestamp <= labels.end_timestamp;

UPDATE 4500

但是,在执行查询大约5分钟后,并没有更新labels表的所有列(仅约30%):

 SELECT COUNT(*) FROM labels WHERE session_id=0;
 count 
-------
 10217
(1 row)

如果可能有帮助,请进一步了解表格:

\d labels
                             Table "akil.labels"
     Column      |           Type           | Collation | Nullable | Default 
-----------------+--------------------------+-----------+----------+---------
 user_id         | integer                  |           | not null | 
 session_id      | bigint                   |           |          | 
 start_timestamp | timestamp with time zone |           | not null | 
 end_timestamp   | timestamp with time zone |           | not null | 
 travelmode      | text                     |           |          | 
Indexes:
    "mode_pkey" PRIMARY KEY, btree (user_id, start_timestamp, end_timestamp)

 \d trajectories 
                       Table "akil.trajectories"
   Column   |           Type           | Collation | Nullable | Default 
------------+--------------------------+-----------+----------+---------
 user_id    | integer                  |           |          | 
 session_id | bigint                   |           | not null | 
 timestamp  | timestamp with time zone |           | not null | 
 lat        | double precision         |           | not null | 
 lon        | double precision         |           | not null | 
 alt        | double precision         |           |          | 
Indexes:
    "trajectory_pkey" PRIMARY KEY, btree (session_id, "timestamp", lat, lon)

编辑

trajectories表添加索引:

 CREATE INDEX idx_trecj ON trajectories (user_id, timestamp);
CREATE INDEX

UPDATE labels
SET session_id=trajectories.session_id
FROM trajectories
WHERE 
  trajectories.user_id = labels.user_id
  AND trajectories.timestamp >= labels.start_timestamp 
  AND trajectories.timestamp <= labels.end_timestamp;

UPDATE 4500

 SELECT COUNT(*) FROM labels WHERE session_id=0;
 count 
-------
 10217
(1 row

但是,并不是session_id中的所有labels都被更新(与初始操作相同)。

sql postgresql
2个回答
0
投票

对于trajectories,您需要在(user_id, timestamp)上建立索引。

这应该有助于您的表现。

您还存在可能存在多个匹配项的问题。在这种情况下,您当前的update必须做的工作比必要的多。但是索引应该是一个很好的第一步。


0
投票

search_path

您的表定义重新显示:

表“ akil.labels”

因此表位于模式akil中。是否始终正确设置search_path,以便您有能力在查询中忽略模式限定?否则,您可能是偶然/从错误的表中更新了错误的表-这可以解释您的结果。参见:

但是还有其他可能的解释。

akil.trajectories中没有匹配项>>

是什么使您认为trajectories中的每一行都有适用的行?

很多人都没有:

labels

SELECT count(*) AS no_match_in_trajectories FROM akil.labels l WHERE NOT EXISTS ( SELECT FROM akil.trajectories t WHERE t.user_id = l.user_id AND t.timestamp >= l.start_timestamp AND t.timestamp <= l.end_timestamp ); 中有多个匹配项>

有关说明:是什么让您认为会有确切地是

适用行?如果akil.trajectories中有多个适用行,则UPDATE会产生任意结果(以昂贵的方式)。

鉴于您的表定义,这将是正确的方法:

trajectories

关于可避免的空白更新:

  • UPDATE akil.labels l SET session_id = t.session_id FROM akil.labels l1 CROSS JOIN LATERAL ( SELECT t1.session_id FROM akil.trajectories t1 WHERE t1.user_id = l1.user_id AND t1.timestamp >= l1.start_timestamp AND t1.timestamp <= l1.end_timestamp ORDER BY t.timestamp -- pick the earliest matching entry LIMIT 1 ) t WHERE ( l.user_id, l.start_timestamp, l.end_timestamp) -- PK = (l1.user_id, l1.start_timestamp, l1.end_timestamp) AND l.session_id IS DISTINCT FROM t.session_id; -- avoids empty updates

    完美的索引是:

    How do I (or can I) SELECT DISTINCT on multiple columns?
  • [CREATE INDEX foo ON akil.trajectories (user_id, timestamp) INCLUDE (session_id) 需要Postgres 11或更高版本。

Asides

您的整个设置看起来可疑地凌乱。

  • INCLUDEINCLUDE PK列的名称。颇具误导性,表似乎已重命名?还是那里有东西?

  • mode_pkey的PK:

    labels

    同样,非默认名称。首先,跨越4列的PK是可疑的。并且trajectories"trajectory_pkey" PRIMARY KEY, btree (session_id, "timestamp", lat, lon) 是浮点数。这会导致重复操作出现舍入错误,从而导致各种混乱。

  • 就像我评论过的:如果latlon,则该列可能应该为session_id

© www.soinside.com 2019 - 2024. All rights reserved.