我的数据如下所示:
idx,year,month,day,metadata,not_impt,metricx
123,2022,12,02,"blah blah","lah lah",-123.94
123,2022,11,05,"blah blah asd","lah lah",62.4
123,2022,12,03,"blah blah asd","lah lah",39.512
123,2022,12,09,"blah blah","lah lah",12.412
123,2022,11,19,"blah blah","lah lah",24.43
123,2022,11,26,"blah blahac ","lah lah",94.94
987,2022,12,12,"blah blah","lah lah",-23.94
987,2022,11,15,"blah blahvs","lah lah",42.4
987,2022,11,03,"blah blah","lah lah",32.512
987,2022,12,04,"blah blah kams","lahada lah",19.412
987,2022,12,19,"blah blah","lah lah",21.43
987,2022,11,26,"blah blah","lah lah",74.94
它们已经被读入 Athena 视图
tablex
并且有一个实数列 metricx
我想在新视图中计算一些统计数据,值范围从 [-500, +500]
。
目标是创建一个新视图,
idx
和 year
和 month
avg_metrix
:计算每个group-by中
metricx
norm_metric
:计算平均 metricx
errorbar_top_metric
和 errorbar_bottom_metric
:计算 95% 置信区间的 误差条顶部和底部值,我已经尝试了以下方法并且它有效,但我没有重新使用在计算误差条值时聚合的值:
CREATE VIEW AS
SELECT idx,
concat(cast(year as varchar), '-', cast(month as varchar)) as date,
count(*) as num_rows,
AVG(metricx) as avg_metric,
((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) as norm_metric,
(
((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) -
(STDDEV_POP(metricx) / SQRT(count(*))) * 1.96)
) as errorbar_bottom_metric,
(
((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) +
(STDDEV_POP(metricx) / SQRT(count(*))) * 1.96)
) as errorbar_top_metric,
FROM tablex
GROUP BY idx, year, month
虽然 SQL 有效,但当我不重用前几列中计算的值时,会出现一些重复。
考虑到数据量并没有那么大< 100,000 rows, is there an cleaner way to write the SQL without copy and pasting the computation of the previous columns?
也许,嵌套
SELECT
查询?
如果你对读者有想法,你可以谈论可读性。
我认为嵌套查询将使大多数读者更容易理解。一步一步回顾计算,可以看出计算的目的和结果。
SELECT idx,
concat(cast(year as char), '-', cast(month as char)) as date,
cnt as num_rows,
avg_metricx as avg_metric,
((avg_min_metricx) / (max_min_metricx)) as norm_metric,
(((avg_min_metricx) / (max_min_metricx)) -std_err) as errorbar_bottom_metric,
(((avg_min_metricx) / (max_min_metricx)) +std_err) as errorbar_top_metric
from(
select idx,year,month
,cnt,avg_metricx,min_metricx,max_metricx,stddev_metricx
,(stddev_metricx / SQRT(cnt)) * 1.96 as std_err
,avg_metricx-min_metricx as avg_min_metricx
,(max_metricx - min_metricx) max_min_metricx
FROM (
select idx,year,month
,count(*) as cnt
,AVG(metricx) as avg_metricx
,MIN(metricx) as min_metricx
,MAX(metricx) as max_metricx
,STDDEV_POP(metricx) stddev_metricx
from tablex
GROUP BY idx, year, month
) x
)y;
MySql 8.0 的执行计划
在 tablex (idx,year,month) 上创建索引 ix_id_dt;
在不知道数据的结构和大小的情况下,这里无话可说。
解释上面的查询
EXPLAIN
-> Table scan on x (cost=0.10..2.88 rows=30) (actual time=0.319..0.321 rows=10 loops=1)
-> Materialize (cost=9.35..12.12 rows=30) (actual time=0.318..0.318 rows=10 loops=1)
-> Group aggregate: count(0), avg(tablex.metricx), min(tablex.metricx), max(tablex.metricx), std(tablex.metricx) (cost=6.25 rows=30) (actual time=0.073..0.148 rows=10 loops=1)
-> Index scan on tablex using ix_id_dt (cost=3.25 rows=30) (actual time=0.034..0.105 rows=30 loops=1)
对于您当前的查询
EXPLAIN
-> Group aggregate: count(0), std(tablex.metricx), min(tablex.metricx), max(tablex.metricx), min(tablex.metricx), avg(tablex.metricx), count(0), std(tablex.metricx), min(tablex.metricx), max(tablex.metricx), min(tablex.metricx), avg(tablex.metricx), min(tablex.metricx), max(tablex.metricx), min(tablex.metricx), avg(tablex.metricx), count(0), avg(tablex.metricx) (cost=6.25 rows=30) (actual time=0.027..0.142 rows=10 loops=1)
-> Index scan on tablex using ix_id_dt (cost=3.25 rows=30) (actual time=0.013..0.105 rows=30 loops=1)