使用 Tablefunc 进行多列透视

问题描述 投票:0回答:3

有人使用

tablefunc
来旋转多个变量,而不是仅使用 行名称

我需要对数十亿行执行此操作,我正在测试以长格式和宽格式存储这些数据,看看是否可以使用

tablefunc
比常规聚合函数更有效地从长格式转换为宽格式。我每分钟会对大约 300 个实体进行大约 100 次测量。通常,我们需要比较给定实体在给定秒内进行的不同测量,因此我们需要经常使用宽格式。此外,对特定实体进行的测量结果差异很大。

作为示例数据,我编辑了回答这个问题时使用的数据:

 CREATE TEMP TABLE t4 (
  timeof   timestamp
 ,entity    character
 ,status    integer
 ,ct        integer);

 INSERT INTO t4 VALUES 
  ('2012-01-01', 'a', 1, 1)
 ,('2012-01-01', 'a', 0, 2)
 ,('2012-01-02', 'b', 1, 3)
 ,('2012-01-02', 'c', 0, 4);

 SELECT * FROM crosstab(
     'SELECT timeof, entity, status, ct
      FROM   t4
      ORDER  BY 1,2,3'
     ,$$VALUES (1::text), (0::text)$$)
 AS ct ("Section" timestamp, "Attribute" character, "1" int, "0" int);

退货:

部分 属性 1 0
2012-01-01 00:00:00 a 1 2
2012-01-02 00:00:00 b 3 4

因此,正如文档所述,假设每个行名称(又名“部分”)的额外列(又名“属性”)相同。因此,它会报告第二行的 b,即使“entity”也具有该 'timeof' 值的 'c' 值。

所需输出:

部分 属性 1 0
2012-01-01 00:00:00 a 1 2
2012-01-02 00:00:00 b 3
2012-01-02 00:00:00 c 4

我使用的一些资源:12

我该怎么做?

sql postgresql pivot pivot-table
3个回答
19
投票

您的查询的问题是

b
c
共享相同的时间戳
2012-01-02 00:00:00
,并且您的查询中首先有
timestamp
timeof
,所以 - 即使您添加了粗体强调 -
b
c
只是属于同一组
2012-01-02 00:00:00
的额外列。自
(引用手册)
以来,仅返回第一个(b

row_name
列必须位于第一列。
category
value
列必须是按顺序排列的最后两列。
row_name
category
之间的任何列都被视为“额外”。对于具有相同 row_name 值的所有行,“额外”列
预计是相同的。

我的粗体强调。

只需恢复前两列的顺序即可使
entity

 成为行名称,它就可以按需要工作:

SELECT * FROM crosstab( 'SELECT entity, timeof, status, ct FROM t4 ORDER BY 1' , 'VALUES (1), (0)' ) AS ct ( "Attribute" character , "Section" timestamp , "status_1" int , "status_0" int );

entity

 当然必须是独一无二的。

重申

  • row_name
    第一
  • (可选)
  • extra
    下一个
  • category
    (由第二个参数定义)和 
    value
     
    last
额外的列从每个

row_name 分区的第 first

 行填充。其他行的值将被忽略,每个 
row_name
 仅填充一列。通常,每行 
row_name
 的值都是相同的,但这取决于您。

基础知识:

  • PostgreSQL 交叉表查询
对于不同的设置

在你的答案中

SELECT localt, entity , msrmnt01, msrmnt02, msrmnt03, msrmnt04, msrmnt05 -- , more? FROM crosstab( 'SELECT dense_rank() OVER (ORDER BY localt, entity)::int AS row_name , localt, entity -- additional columns , msrmnt, val FROM test -- WHERE ??? -- instead of LIMIT at the end ORDER BY localt, entity, msrmnt -- LIMIT ???' -- instead of LIMIT at the end , 'SELECT generate_series(1,5)' -- more? ) AS ct (row_name int, localt timestamp, entity int , msrmnt01 float8, msrmnt02 float8, msrmnt03 float8, msrmnt04 float8, msrmnt05 float8 -- , more? ) LIMIT 1000 -- ?!
难怪测试中的查询执行得非常糟糕。您的测试设置有 14M 行,您需要处理所有行,然后使用 

LIMIT 1000 丢弃大部分行。对于缩减结果集,请在源查询中添加 WHERE

 条件或 
LIMIT
此外,您使用的阵列还不必要地昂贵。我用 
dense_rank()

生成了一个代理行名称。

db<>fiddle 
here

- 测试设置更简单,行数更少。

在我最初的问题中,我应该将其用于我的示例数据:

13
投票
CREATE TEMP TABLE t4 ( timeof date ,entity integer ,status integer ,ct integer); INSERT INTO t4 VALUES ('2012-01-01', 1, 1, 1) ,('2012-01-01', 1, 0, 2) ,('2012-01-01', 3, 0, 3) ,('2012-01-02', 2, 1, 4) ,('2012-01-02', 3, 1, 5) ,('2012-01-02', 3, 0, 6);

有了这个,我必须以 timeof 和实体为中心。由于
tablefunc
仅使用一列进行旋转,因此您需要找到一种方法来填充该列中的两个维度。 (

http://www.postgresonline.com/journal/categories/24-tablefunc

)。我使用了数组,就像该链接中的示例一样。
SELECT (timestamp 'epoch' + row_name[1] * INTERVAL '1 second')::date as localt, row_name[2] As entity, status1, status0 FROM crosstab('SELECT ARRAY[extract(epoch from timeof), entity] as row_name, status, ct FROM t4 ORDER BY timeof, entity, status' ,$$VALUES (1::text), (0::text)$$) as ct (row_name integer[], status1 int, status0 int)

FWIW,我尝试使用字符数组,到目前为止,这对于我的设置来说似乎更快; 9.2.3 Postgresql.

这是结果和期望的输出。

localt | entity | status1 | status0 --------------------------+---------+-------- 2012-01-01 | 1 | 1 | 2 2012-01-01 | 3 | | 3 2012-01-02 | 2 | 4 | 2012-01-02 | 3 | 5 | 6

我很好奇它在更大的数据集上的表现如何,并将在稍后报告。
    

好的,所以我在靠近我的用例的桌子上运行了这个。要么我做错了,要么交叉表不适合我的使用。

2
投票
首先我做了一些类似的数据:

CREATE TABLE public.test ( id serial primary key, msrmnt integer, entity integer, localt timestamp, val double precision ); CREATE INDEX ix_test_msrmnt ON public.test (msrmnt); CREATE INDEX ix_public_test_201201_entity ON public.test (entity); CREATE INDEX ix_public_test_201201_localt ON public.test (localt); insert into public.test (msrmnt, entity, localt, val) select * from( SELECT msrmnt, entity, localt, random() as val FROM generate_series('2012-01-01'::timestamp, '2012-01-01 23:59:00'::timestamp, interval '1 minutes') as localt join (select * FROM generate_series(1, 50, 1) as msrmnt) as msrmnt on 1=1 join (select * FROM generate_series(1, 200, 1) as entity) as entity on 1=1) as data;

然后我运行了几次交叉表代码:

explain analyze SELECT (timestamp 'epoch' + row_name[1] * INTERVAL '1 second')::date As localt, row_name[2] as entity ,msrmnt01,msrmnt02,msrmnt03,msrmnt04,msrmnt05,msrmnt06,msrmnt07,msrmnt08,msrmnt09,msrmnt10 ,msrmnt11,msrmnt12,msrmnt13,msrmnt14,msrmnt15,msrmnt16,msrmnt17,msrmnt18,msrmnt19,msrmnt20 ,msrmnt21,msrmnt22,msrmnt23,msrmnt24,msrmnt25,msrmnt26,msrmnt27,msrmnt28,msrmnt29,msrmnt30 ,msrmnt31,msrmnt32,msrmnt33,msrmnt34,msrmnt35,msrmnt36,msrmnt37,msrmnt38,msrmnt39,msrmnt40 ,msrmnt41,msrmnt42,msrmnt43,msrmnt44,msrmnt45,msrmnt46,msrmnt47,msrmnt48,msrmnt49,msrmnt50 FROM crosstab('SELECT ARRAY[extract(epoch from localt), entity] as row_name, msrmnt, val FROM public.test ORDER BY localt, entity, msrmnt',$$VALUES ( 1::text),( 2::text),( 3::text),( 4::text),( 5::text),( 6::text),( 7::text),( 8::text),( 9::text),(10::text) ,(11::text),(12::text),(13::text),(14::text),(15::text),(16::text),(17::text),(18::text),(19::text),(20::text) ,(21::text),(22::text),(23::text),(24::text),(25::text),(26::text),(27::text),(28::text),(29::text),(30::text) ,(31::text),(32::text),(33::text),(34::text),(35::text),(36::text),(37::text),(38::text),(39::text),(40::text) ,(41::text),(42::text),(43::text),(44::text),(45::text),(46::text),(47::text),(48::text),(49::text),(50::text)$$) as ct (row_name integer[],msrmnt01 double precision, msrmnt02 double precision,msrmnt03 double precision, msrmnt04 double precision,msrmnt05 double precision, msrmnt06 double precision,msrmnt07 double precision, msrmnt08 double precision,msrmnt09 double precision, msrmnt10 double precision ,msrmnt11 double precision, msrmnt12 double precision,msrmnt13 double precision, msrmnt14 double precision,msrmnt15 double precision, msrmnt16 double precision,msrmnt17 double precision, msrmnt18 double precision,msrmnt19 double precision, msrmnt20 double precision ,msrmnt21 double precision, msrmnt22 double precision,msrmnt23 double precision, msrmnt24 double precision,msrmnt25 double precision, msrmnt26 double precision,msrmnt27 double precision, msrmnt28 double precision,msrmnt29 double precision, msrmnt30 double precision ,msrmnt31 double precision, msrmnt32 double precision,msrmnt33 double precision, msrmnt34 double precision,msrmnt35 double precision, msrmnt36 double precision,msrmnt37 double precision, msrmnt38 double precision,msrmnt39 double precision, msrmnt40 double precision ,msrmnt41 double precision, msrmnt42 double precision,msrmnt43 double precision, msrmnt44 double precision,msrmnt45 double precision, msrmnt46 double precision,msrmnt47 double precision, msrmnt48 double precision,msrmnt49 double precision, msrmnt50 double precision) limit 1000

第三次尝试获得此结果:

QUERY PLAN Limit (cost=0.00..20.00 rows=1000 width=432) (actual time=110236.673..110237.667 rows=1000 loops=1) -> Function Scan on crosstab ct (cost=0.00..20.00 rows=1000 width=432) (actual time=110236.672..110237.598 rows=1000 loops=1) Total runtime: 110699.598 ms

然后我运行了几次标准解决方案:

explain analyze select localt, entity, max(case when msrmnt = 1 then val else null end) as msrmnt01 ,max(case when msrmnt = 2 then val else null end) as msrmnt02 ,max(case when msrmnt = 3 then val else null end) as msrmnt03 ,max(case when msrmnt = 4 then val else null end) as msrmnt04 ,max(case when msrmnt = 5 then val else null end) as msrmnt05 ,max(case when msrmnt = 6 then val else null end) as msrmnt06 ,max(case when msrmnt = 7 then val else null end) as msrmnt07 ,max(case when msrmnt = 8 then val else null end) as msrmnt08 ,max(case when msrmnt = 9 then val else null end) as msrmnt09 ,max(case when msrmnt = 10 then val else null end) as msrmnt10 ,max(case when msrmnt = 11 then val else null end) as msrmnt11 ,max(case when msrmnt = 12 then val else null end) as msrmnt12 ,max(case when msrmnt = 13 then val else null end) as msrmnt13 ,max(case when msrmnt = 14 then val else null end) as msrmnt14 ,max(case when msrmnt = 15 then val else null end) as msrmnt15 ,max(case when msrmnt = 16 then val else null end) as msrmnt16 ,max(case when msrmnt = 17 then val else null end) as msrmnt17 ,max(case when msrmnt = 18 then val else null end) as msrmnt18 ,max(case when msrmnt = 19 then val else null end) as msrmnt19 ,max(case when msrmnt = 20 then val else null end) as msrmnt20 ,max(case when msrmnt = 21 then val else null end) as msrmnt21 ,max(case when msrmnt = 22 then val else null end) as msrmnt22 ,max(case when msrmnt = 23 then val else null end) as msrmnt23 ,max(case when msrmnt = 24 then val else null end) as msrmnt24 ,max(case when msrmnt = 25 then val else null end) as msrmnt25 ,max(case when msrmnt = 26 then val else null end) as msrmnt26 ,max(case when msrmnt = 27 then val else null end) as msrmnt27 ,max(case when msrmnt = 28 then val else null end) as msrmnt28 ,max(case when msrmnt = 29 then val else null end) as msrmnt29 ,max(case when msrmnt = 30 then val else null end) as msrmnt30 ,max(case when msrmnt = 31 then val else null end) as msrmnt31 ,max(case when msrmnt = 32 then val else null end) as msrmnt32 ,max(case when msrmnt = 33 then val else null end) as msrmnt33 ,max(case when msrmnt = 34 then val else null end) as msrmnt34 ,max(case when msrmnt = 35 then val else null end) as msrmnt35 ,max(case when msrmnt = 36 then val else null end) as msrmnt36 ,max(case when msrmnt = 37 then val else null end) as msrmnt37 ,max(case when msrmnt = 38 then val else null end) as msrmnt38 ,max(case when msrmnt = 39 then val else null end) as msrmnt39 ,max(case when msrmnt = 40 then val else null end) as msrmnt40 ,max(case when msrmnt = 41 then val else null end) as msrmnt41 ,max(case when msrmnt = 42 then val else null end) as msrmnt42 ,max(case when msrmnt = 43 then val else null end) as msrmnt43 ,max(case when msrmnt = 44 then val else null end) as msrmnt44 ,max(case when msrmnt = 45 then val else null end) as msrmnt45 ,max(case when msrmnt = 46 then val else null end) as msrmnt46 ,max(case when msrmnt = 47 then val else null end) as msrmnt47 ,max(case when msrmnt = 48 then val else null end) as msrmnt48 ,max(case when msrmnt = 49 then val else null end) as msrmnt49 ,max(case when msrmnt = 50 then val else null end) as msrmnt50 from sample group by localt, entity limit 1000

第三次尝试获得此结果:

QUERY PLAN Limit (cost=2257339.69..2270224.77 rows=1000 width=24) (actual time=19795.984..20090.626 rows=1000 loops=1) -> GroupAggregate (cost=2257339.69..5968242.35 rows=288000 width=24) (actual time=19795.983..20090.496 rows=1000 loops=1) -> Sort (cost=2257339.69..2293339.91 rows=14400088 width=24) (actual time=19795.626..19808.820 rows=50001 loops=1) Sort Key: localt Sort Method: external merge Disk: 478568kB -> Seq Scan on sample (cost=0.00..249883.88 rows=14400088 width=24) (actual time=0.013..2245.247 rows=14400000 loops=1) Total runtime: 20197.565 ms

因此,就我而言,到目前为止,交叉表并不是一个解决方案。这只是我将拥有多年的一天。事实上,我可能不得不使用宽格式(非标准化)表格,尽管对实体进行的测量是可变的并且引入了新的测量,但我不会在这里讨论。

这是我使用 Postgres 9.2.3 的一些设置:

name setting max_connections 100 shared_buffers 2097152 effective_cache_size 6291456 maintenance_work_mem 1048576 work_mem 262144


© www.soinside.com 2019 - 2024. All rights reserved.