Clickhouse SummingMergeTree +大量ORDER BY字段

问题描述 投票:0回答:1

朋友们,项目中有一个这样的表格:

CREATE TABLE events_1h 
(
`round_time` DateTime,

`dt` UInt8,

`aa_id` UInt64,
`bb_id` UInt64,
`cc_id` UInt64,

`cpu_architecture` String,

`browser_name` String,
`browser_version` String,
`browser_major` String,

`os_name` String,
`os_version` String,

`device_type` String,
`device_vendor` String,
`device_model` String,

`country` FixedString(2),
`city` UInt32,

`aso` UInt32,
`asn` UInt32,

`referer` String,

`request` UInt32,
`answer` UInt32,

`Impression` UInt32,

`Error` UInt32,

`start` UInt32,

// Many other events here   

)
ENGINE = SummingMergeTree
PRIMARY KEY (round_time, dt, aa_id, bb_id, cc_id)
ORDER BY (round_time, dt, aa_id, bb_id, cc_id, cpu_architecture, browser_name, browser_version, browser_major, os_name, os_version, device_type, device_vendor, device_model, country, city, aso, asn, referer);

指标的 ORDER BY 键数量很大。这对于基于指标的任何组合中的数据交叉是必要的。如果我们为所需的指标创建单独的组合,那么事件字段就会重复,而且数量也有几十个。

问题是用自定义哈希字段替换大型 ORDER BY 并根据它对数据进行求和,使当前 ORDER BY 字段只是索引以加快查询是否有意义,或者 ORDER BY 本身是否转换为哈希下是否有意义引擎盖和这种方法没有意义。一般来说,了解优化此类情况的实践是很有趣的。谢谢!

未来,我预计指标数量和数据量都会增长,这可能会导致性能问题。

database bigdata clickhouse
1个回答
0
投票

自定义哈希字段没有意义,当

SummingMergeTree
之后的行将比原始
中的行少至少3-5倍时,应用
GROUP BY <all_fields_from_ORDER_BY>

才有意义

更好地尝试使用 engine=MergeTree 和多个投影来使用 GROUP BY 来进行不同的最常用字段组合

查看文档 https://clickhouse.com/docs/en/sql-reference/statements/alter/projection#example-pre-aggregation-query

并且更好地移动round_time

PRIMARY KEY (dt, aa_id, bb_id, cc_id, round_time)

查看文档 https://kb.altinity.com/engines/mergetree-table-engine-family/pick-keys/

© www.soinside.com 2019 - 2024. All rights reserved.