我有一个 Hive 表,如下所示:
事件名称 | 每个国家的与会者人数 |
---|---|
a | {'美国':5} |
b | {“美国”:4,“英国”:3,“加拿大”:2} |
c | {'英国':2,'加拿大':1} |
我想要一张如下所示的新桌子:
国家 | 人数 |
---|---|
美国 | 9 |
英国 | 5 |
CA | 4 |
如何在 Hive 或 Presto 中编写查询?
您可以使用以下内容:
如果
attendees_per_countries
的列类型是字符串,您可以使用以下内容:
WITH sample_data AS (
select
event_name,
str_to_map(
regexp_replace(attendees_per_countries,'[{|}]',''),
',',
':'
) as attendees_per_countries
FROM
raw_data
)
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
但是,如果
attendees_per_countries
的列类型已经是 map
那么您可以使用以下
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
下面是完整的可重现示例
with raw_data AS (
select 'a' as event_name, "{'US':5}" as attendees_per_countries
UNION ALL
select 'b', "{'US':4, 'UK': 3, 'CA': 2}"
UNION ALL
select 'c', "{'UK':2, 'CA': 1}"
),
sample_data AS (
select
event_name,
str_to_map(
regexp_replace(attendees_per_countries,'[{}]',''),
',',
':'
) as attendees_per_countries
FROM
raw_data
)
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
让我知道这是否适合您
如果你有
attendees_per_countries
作为地图,你可以使用map_values
,然后用array_sum
/reduce
将它们相加(我需要稍后使用,因为Athena不支持前一个)。如果没有 - 您可以将数据视为 JSON 并将其转换为 MAP(VARCHAR, INTEGER)
,然后使用提到的函数:
WITH dataset(event_name, attendees_per_countries) AS (
VALUES
('a', JSON '{"US":5}'),
('b', JSON '{"US":4, "UK": 3, "CA": 2}'),
('c', JSON '{"UK":2, "CA": 1}')
)
SELECT event_name as country,
reduce(
map_values(cast(attendees_per_countries as MAP(VARCHAR, INTEGER))),
0,
(agg, curr) -> agg + curr,
s -> s
) as number_of_people
FROM dataset
order by 2 desc
输出:
国家 | 人数 |
---|---|
b | 9 |
a | 5 |
c | 3 |