假设我有一个名为
t
的表,有两列 foo
和 bar
。
foo bar
=======
1 11
1 11
2 11
2 11
2 11
3 11
3 12
3 12
-------
现在我想分别计算
foo
和 bar
不同值的出现次数,并将它们聚合成 ARRAY<MAP<BIGINT, BIGINT>>
。在这个例子中,foo == 1
出现了2次; foo = 2
出现3次; foo = 3
出现3次; bar == 11
出现了6次; bar == 12
出现了2次。因此,结果表应如下所示:
name cnt
============================
"foo" [{1:2}, {2:3}, {3:3}]
"bar" [{11:6}, {12:2}]
----------------------------
我目前的做法有点像这样:
WITH t_foo AS (
SELECT
"foo" AS name,
COLLECT_LIST(MAP(val, cnt)) AS cnt
FROM (
SELECT
foo AS val,
COUNT(*) AS cnt
FROM
t
GROUP BY
foo
) AS tt
),
t_bar AS (
SELECT
"bar" AS name,
COLLECT_LIST(MAP(val, cnt)) AS cnt
FROM (
SELECT
bar AS val,
COUNT(*) AS cnt
FROM
t
GROUP BY
bar
) AS tt
)
SELECT * FROM t_foo
UNION ALL SELECT * FROM t_bar
这可行,但看起来相当重复。其实我不仅有
foo
和bar
,还有其他十几个栏目要处理。有更聪明的方法来解决这个问题吗?
为了泛化此代码,您应该需要动态查询,但这种方法可能很繁重并且容易受到 SQL 注入攻击。
但是,您仍然可以在不使用动态查询的情况下做一些事情,那就是:
COLLECT_LIST
操作的并集。WITH cte AS (
SELECT DISTINCT foo, COUNT(*) OVER(PARTITION BY foo) AS cnt_foo,
bar, COUNT(*) OVER(PARTITION BY bar) AS cnt_bar
FROM t
)
SELECT "foo" AS name, COLLECT_LIST(MAP(foo, cnt_foo)) AS cnt FROM cte
UNION ALL
SELECT "bar" AS name, COLLECT_LIST(MAP(bar, cnt_bar)) AS cnt FROM cte
应该比原来的表现更好。