如何获取date_part查询以命中索引?

问题描述 投票:1回答:1

我还没有能够得到这个查询来搜索索引而不是执行完整扫描 - 我有另一个查询使用date_part('day',datelocal)对几乎相同的表(该表只有少一点的数据,但相同的结构)并且会触及我在datelocal列上创建的索引(这是一个没有时区的时间戳)。查询(这个在表上执行并行seq扫描并执行内存快速排序):

SELECT
    date_part('hour', datelocal) AS hour,
    SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)

这是另一个命中我的datelocal索引:

SELECT
    date_part('day', datelocal) AS day,
    SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpressionday
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_trunc('day', datelocal), date_part('day', datelocal)
ORDER BY date_trunc('day', datelocal)

对此我的头脑!关于如何加速第一个或者至少让它达到指数的任何想法?我尝试在datelocal字段上创建一个索引,在datelocal,gender和views上创建一个复合索引,在date_part('hour',datelocal)上创建一个表达式索引,但这些都没有效果。

架构:

-- Table Definition ----------------------------------------------

CREATE TABLE reportimpression (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpression_datelocal_index ON reportimpression(datelocal timestamp_ops);
CREATE INDEX reportimpression_viewership_index ON reportimpression(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
CREATE INDEX reportimpression_test_index ON reportimpression(datelocal timestamp_ops,(date_part('hour'::text, datelocal)) float8_ops);
-- Table Definition ----------------------------------------------

CREATE TABLE reportimpressionday (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpressionday_datelocal_index ON reportimpressionday(datelocal timestamp_ops);
CREATE INDEX reportimpressionday_detail_index ON reportimpressionday(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);

解释(分析,缓冲)输出:

Finalize GroupAggregate  (cost=999842.42..999859.67 rows=3137 width=24) (actual time=43754.700..43754.714 rows=24 loops=1)
  Group Key: (date_part('hour'::text, datelocal))
  Buffers: shared hit=123912 read=823290
  I/O Timings: read=81228.280
  ->  Sort  (cost=999842.42..999843.99 rows=3137 width=24) (actual time=43754.695..43754.698 rows=48 loops=1)
        Sort Key: (date_part('hour'::text, datelocal))
        Sort Method: quicksort  Memory: 28kB
        Buffers: shared hit=123912 read=823290
        I/O Timings: read=81228.280
        ->  Gather  (cost=999481.30..999805.98 rows=3137 width=24) (actual time=43754.520..43777.558 rows=48 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              Buffers: shared hit=123912 read=823290
              I/O Timings: read=81228.280
              ->  Partial HashAggregate  (cost=998481.30..998492.28 rows=3137 width=24) (actual time=43751.649..43751.672 rows=24 loops=2)
                    Group Key: date_part('hour'::text, datelocal)
                    Buffers: shared hit=123912 read=823290
                    I/O Timings: read=81228.280
                    ->  Parallel Seq Scan on reportimpression  (cost=0.00..991555.98 rows=2770129 width=17) (actual time=13.097..42974.126 rows=2338145 loops=2)
                          Filter: ((datelocal >= '2019-02-01 00:00:00'::timestamp without time zone) AND (datelocal < '2019-02-28 00:00:00'::timestamp without time zone))
                          Rows Removed by Filter: 6792750
                          Buffers: shared hit=123912 read=823290
                          I/O Timings: read=81228.280
Planning time: 0.185 ms
Execution time: 43777.701 ms
postgresql indexing aggregate postgresql-performance
1个回答
1
投票

好吧,你的查询都在不同的表上(reportimpressionreportimpressionday),所以两个查询的比较实际上不是比较。你们两个都是ANALYZE吗?各种列统计也可能发挥作用。索引或表格膨胀可能不同。所有行的大部分是否符合2019年2月的要求?等等。

在黑暗中一击,比较两个表的百分比:

SELECT tbl, round(share * 100 / total, 2) As percentage
FROM  (
   SELECT text 'reportimpression' AS tbl
        , count(*)::numeric AS total
        , count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')::numeric AS share
   FROM  reportimpression

   UNION ALL
   SELECT 'reportimpressionday'
        , count(*)
        , count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')
   FROM  reportimpressionday
  ) sub;

reportimpression的那个更大吗?然后它可能会超过索引预期有助于的数量。

通常,你的索引reportimpression_datelocal_index on(datelocal)看起来不错,而reportimpression_viewership_index甚至允许仅索引扫描,如果autovacuum击败表上的写入负载。 (虽然impressionsagegroup只是为此付出了代价,如果没有它会更好)。

回答

你有我的查询26.6 percent, and day is 26.4 percent。对于如此大的百分比,索引通常根本没用。顺序扫描通常是最快的方式。如果底层表格更大,那么只有仅索引扫描才有意义。 (或者你有严重的表膨胀,而且索引更少,这使得索引再次具有吸引力。)

您的第一个查询可能只是跨越临界点。尝试缩小时间范围,直到看到仅索引扫描。你不会看到(位图)索引扫描超过大约5%的所有行符合条件(取决于许多因素)。

Queries

尽管如此,请考虑以下修改过的查询:

SELECT date_part('hour', datelocal)                AS hour
     , SUM(views) FILTER (WHERE gender = 'male')   AS male
     , SUM(views) FILTER (WHERE gender = 'female') AS female
FROM   reportimpression
WHERE  datelocal >= '2019-02-01'
AND    datelocal <  '2019-03-01' -- '2019-02-28'  -- ?
GROUP  BY 1
ORDER  BY 1;

SELECT date_trunc('day', datelocal)                AS day
     , SUM(views) FILTER (WHERE gender = 'male')   AS male
     , SUM(views) FILTER (WHERE gender = 'female') AS female
FROM   reportimpressionday
WHERE  datelocal >= '2019-02-01'
AND    datelocal <  '2019-03-01'
GROUP  BY 1
ORDER  BY 1;

Major points

  • 使用像'2-1-2019'这样的本地化日期格式时,请使用带有显式格式说明符的to_timestamp()。否则,这取决于区域设置,并且在从具有不同设置的会话调用时可能会中断(静默)。而是使用不依赖于区域设置的ISO日期/时间格式。
  • 看起来你想要包括整个二月份。但是你的查询错过了上限。一个,二月可能有29天。 datelocal < '2-28-2019'也排除了2月28日的全部。请改用datelocal < '2019-03-01'
  • 如果可以的话,使用与SELECT列表中相同的表达式进行分组和排序会更便宜。所以也在那里使用date_trunc()。不需要使用不同的表达式。如果您需要结果中的datepart,请将其应用于分组表达式,如: SELECT date_part('day', date_trunc('day', datelocal)) AS day ... GROUP BY date_trunc('day', datelocal) ORDER BY date_trunc('day', datelocal); 代码更嘈杂,但速度更快(也可能更容易针对查询规划器进行优化)。
  • 使用Postgres 9.4或更高版本中的聚合FILTER子句。它更干净,更快。看到: How can I simplify this game statistics query? For absolute performance, is SUM faster or COUNT?
© www.soinside.com 2019 - 2024. All rights reserved.