如何在 Hive 或 Presto 中将以下字典格式列转换为不同格式？

Question

我有一个 Hive 表，如下所示：

事件名称	每个国家的与会者人数
a	{'美国':5}
b	{“美国”：4，“英国”：3，“加拿大”：2}
c	{'英国'：2，'加拿大'：1}

我想要一张如下所示的新桌子：

国家	人数
美国	9
英国	5
CA	4

如何在 Hive 或 Presto 中编写查询？

Answer 1

您可以使用以下内容：

如果

attendees_per_countries

的列类型是字符串，您可以使用以下内容：

WITH sample_data AS (
    select 
        event_name, 
        str_to_map(
            regexp_replace(attendees_per_countries,'[{|}]',''),
            ',',
            ':'
        ) as attendees_per_countries 
    FROM
        raw_data
        
)
select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

但是，如果

attendees_per_countries

的列类型已经是

map

那么您可以使用以下

select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

下面是完整的可重现示例

with raw_data AS (
    select 'a' as event_name, "{'US':5}" as attendees_per_countries
    UNION ALL 
    select 'b', "{'US':4, 'UK': 3, 'CA': 2}"
    UNION ALL 
    select 'c', "{'UK':2, 'CA': 1}"
),
sample_data AS (
    select 
        event_name, 
        str_to_map(
            regexp_replace(attendees_per_countries,'[{}]',''),
            ',',
            ':'
        ) as attendees_per_countries 
    FROM
        raw_data
        
)
select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

让我知道这是否适合您

Answer 2

如果你有

attendees_per_countries

作为地图，你可以使用

map_values

，然后用

array_sum

/

reduce

将它们相加（我需要稍后使用，因为Athena不支持前一个）。如果没有 - 您可以将数据视为 JSON 并将其转换为

MAP(VARCHAR, INTEGER)

，然后使用提到的函数：

WITH dataset(event_name, attendees_per_countries) AS (
   VALUES 
('a',   JSON '{"US":5}'),
('b',   JSON '{"US":4, "UK": 3, "CA": 2}'),
('c',   JSON '{"UK":2, "CA": 1}')
 ) 
 
SELECT event_name as country,
       reduce(
               map_values(cast(attendees_per_countries as MAP(VARCHAR, INTEGER))),
               0,
               (agg, curr) -> agg + curr,
               s -> s
           )      as number_of_people
FROM dataset
order by 2 desc

输出：

国家	人数
b	9
a	5
c	3

如何在 Hive 或 Presto 中将以下字典格式列转换为不同格式？

问题描述投票：0回答：2

2个回答

最新问题

如何在 Hive 或 Presto 中将以下字典格式列转换为不同格式？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2