如何将先前列定义中的值用于 Athena SQL 查询?

问题描述 投票:0回答:1

我的数据如下所示:

idx,year,month,day,metadata,not_impt,metricx
123,2022,12,02,"blah blah","lah lah",-123.94
123,2022,11,05,"blah blah asd","lah lah",62.4
123,2022,12,03,"blah blah asd","lah lah",39.512
123,2022,12,09,"blah blah","lah lah",12.412
123,2022,11,19,"blah blah","lah lah",24.43
123,2022,11,26,"blah blahac ","lah lah",94.94
987,2022,12,12,"blah blah","lah lah",-23.94
987,2022,11,15,"blah blahvs","lah lah",42.4
987,2022,11,03,"blah blah","lah lah",32.512
987,2022,12,04,"blah blah kams","lahada lah",19.412
987,2022,12,19,"blah blah","lah lah",21.43
987,2022,11,26,"blah blah","lah lah",74.94

它们已经被读入 Athena 视图

tablex
并且有一个实数列
metricx
我想在新视图中计算一些统计数据,值范围从
[-500, +500]

目标是创建一个新视图,

  • idx
    year
    month
  • 分组
  • avg_metrix
    :计算每个group-by
    metricx
  • 的平均值
  • norm_metric
    :计算平均
    metricx
  • 的最小-最大归一化值
  • errorbar_top_metric
    errorbar_bottom_metric
    :计算 95% 置信区间的 误差条顶部和底部值,

我已经尝试了以下方法并且它有效,但我没有重新使用在计算误差条值时聚合的值:

CREATE VIEW AS
SELECT idx,
  concat(cast(year as varchar), '-', cast(month as varchar)) as date,
  count(*) as num_rows,
  AVG(metricx) as avg_metric,
  ((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) as norm_metric,
  
  (
    ((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) -
    (STDDEV_POP(metricx) / SQRT(count(*))) * 1.96)
  ) as errorbar_bottom_metric,

  (
    ((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) +
    (STDDEV_POP(metricx) / SQRT(count(*))) * 1.96)
  ) as errorbar_top_metric,

FROM tablex
GROUP BY idx, year, month

虽然 SQL 有效,但当我不重用前几列中计算的值时,会出现一些重复。

考虑到数据量并没有那么大< 100,000 rows, is there an cleaner way to write the SQL without copy and pasting the computation of the previous columns?

也许,嵌套

SELECT
查询?

sql amazon-web-services amazon-athena normalization minmax
1个回答
0
投票

如果你对读者有想法,你可以谈论可读性。
我认为嵌套查询将使大多数读者更容易理解。一步一步回顾计算,可以看出计算的目的和结果。

SELECT idx,
  concat(cast(year as char), '-', cast(month as char)) as date,
  cnt as num_rows,
  avg_metricx as avg_metric,
  ((avg_min_metricx) / (max_min_metricx)) as norm_metric,
  
  (((avg_min_metricx) / (max_min_metricx)) -std_err) as errorbar_bottom_metric,

  (((avg_min_metricx) / (max_min_metricx)) +std_err) as errorbar_top_metric
from(
  select idx,year,month
     ,cnt,avg_metricx,min_metricx,max_metricx,stddev_metricx
     ,(stddev_metricx / SQRT(cnt)) * 1.96  as std_err
     ,avg_metricx-min_metricx as avg_min_metricx
     ,(max_metricx - min_metricx) max_min_metricx
FROM (
  select idx,year,month
     ,count(*) as cnt
     ,AVG(metricx) as avg_metricx
     ,MIN(metricx) as min_metricx
     ,MAX(metricx) as max_metricx
     ,STDDEV_POP(metricx) stddev_metricx
  from tablex
  GROUP BY idx, year, month
  ) x
)y;

MySql 8.0 的执行计划 在 tablex (idx,year,month) 上创建索引 ix_id_dt;
在不知道数据的结构和大小的情况下,这里无话可说。

解释上面的查询

EXPLAIN
-> Table scan on x  (cost=0.10..2.88 rows=30) (actual time=0.319..0.321 rows=10 loops=1)
    -> Materialize  (cost=9.35..12.12 rows=30) (actual time=0.318..0.318 rows=10 loops=1)
        -> Group aggregate: count(0), avg(tablex.metricx), min(tablex.metricx), max(tablex.metricx), std(tablex.metricx)  (cost=6.25 rows=30) (actual time=0.073..0.148 rows=10 loops=1)
            -> Index scan on tablex using ix_id_dt  (cost=3.25 rows=30) (actual time=0.034..0.105 rows=30 loops=1)

对于您当前的查询

EXPLAIN
-> Group aggregate: count(0), std(tablex.metricx), min(tablex.metricx), max(tablex.metricx), min(tablex.metricx), avg(tablex.metricx), count(0), std(tablex.metricx), min(tablex.metricx), max(tablex.metricx), min(tablex.metricx), avg(tablex.metricx), min(tablex.metricx), max(tablex.metricx), min(tablex.metricx), avg(tablex.metricx), count(0), avg(tablex.metricx)  (cost=6.25 rows=30) (actual time=0.027..0.142 rows=10 loops=1)
    -> Index scan on tablex using ix_id_dt  (cost=3.25 rows=30) (actual time=0.013..0.105 rows=30 loops=1)
© www.soinside.com 2019 - 2024. All rights reserved.