如何获取多列的前N个值?

问题描述 投票:0回答:1

假设我有下表:

WITH tbl AS (
    SELECT 1 AS id, "Phone" AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, "Camera" AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, "Cup" AS product, 103 AS cost
) SELECT * FROM tbl

如何获得每列的 N 个值?例如,要显示值的示例而不必为每列运行查询?也就是说,我想一次性把它们全部抓住。到目前为止我有类似的东西:

WITH tbl AS (
    SELECT 1 AS id, 'Phone' AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, 'Camera' AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, 'Cup' AS product, 103 AS cost
) 
SELECT 
    ARRAY_AGG(DISTINCT id LIMIT 2),
    ARRAY_AGG(DISTINCT product LIMIT 2),
    ARRAY_AGG(DISTINCT cost LIMIT 2)
FROM tbl

这可行,但似乎效率很低(我相信与为每列运行查询相同)。有什么更好的方法来做到这一点?

或者,概括一下我认为是一种糟糕但适用于 BQ 之外的方法:

WITH tbl AS (
    SELECT 1 AS id, 'Phone' AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, 'Camera' AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, 'Cup' AS product, 103 AS cost
)  
select 'id' as field, array(select distinct cast(id as string) from tbl limit 2) as values union all
select 'product', array(select distinct cast(product as string) from tbl limit 2) union all
select 'cost', array(select distinct cast(cost as string) from tbl limit 2);
sql postgresql google-bigquery
1个回答
0
投票

你的问题留下了很大的解释空间。您提到“第一”,但没有定义它。您的查询有

DISTINCT
,但之前没有提及任何相关内容。您的示例既没有显示空值,也没有显示重复项,目前尚不清楚如何处理这些值。

这会运行一次非常便宜的顺序扫描,并在小

LIMIT
处停止:

SELECT array_agg(id      ORDER BY id     ) AS ids
     , array_agg(product ORDER BY product) AS products
     , array_agg(cost    ORDER BY cost   ) AS costs
FROM  (
   SELECT id, product, cost 
   FROM   tbl
   -- no ORDER BY, take arbitrary rows cheaply
   LIMIT  2
   ) sub;

通过“有序集合聚合函数”可以得到更具代表性的样本

percentile_disc()
。喜欢:

SELECT percentile_disc('{0,.5,1}'::float[]) WITHIN GROUP (ORDER BY id     ) AS pctl_id
     , percentile_disc('{0,.5,1}'::float[]) WITHIN GROUP (ORDER BY product) AS pctl_product   
     , percentile_disc('{0,.5,1}'::float[]) WITHIN GROUP (ORDER BY cost   ) AS pctl_cost
FROM   tbl;

这样,您可以选择排序顺序的位置以及从每列中选取多少个值,同时仍然扫描一次表格。 或者,对于大表,基于一个小的半随机样本以使其更便宜(同时代表性较差):

SELECT percentile_disc('{0,.5,1}'::float[]) WITHIN GROUP (ORDER BY id     ) AS pctl_id
     , ...
FROM   tbl  TABLESAMPLE SYSTEM (10);

空值或重复值在其中任何一个中都没有得到特殊处理。这些功能(以及更多功能)的任意组合都可以优化性能、随机性、有效性……
您只需要准确地定义您需要的内容。

参见:

© www.soinside.com 2019 - 2024. All rights reserved.