GROUP by和SUM with MAX()

问题描述 投票:0回答:1

我有一个包含年份,国家,性别和人口列的数据集。我需要在最近一年找到人口最多的国家

a = group data by Country;
b = foreach a generate flatten(group), MAX(data.Year);
# Until here I am able to get the country and latest year 
# SUM on data.Population is giving errors

我需要按照以下顺序获得结果国家,年份和人口(仅限该年份)

hadoop apache-pig hortonworks-data-platform
1个回答
0
投票

获得每个国家/地区的最大年份后,将该数据集与第一个负载关系相连,然后按国家/地区和年份进行分组,以获得总数。

假设您已将数据加载到名为data的关系中。使用国家和年份的b加载数据。

data = load 'data_file' using PigStorage(',') as (country:chararray,year:int,population:int);
a = group data by country;
b = foreach a generate flatten(group) as country, MAX(data.Year) as year;
c = join data by (country,year), b BY (country,year);
c1 = foreach c generate data.country as country,data.year as year,data.population as population;
d = group c1 by c1.country,c1.year;
e = foreach d generate FLATTEN(group) as country,year,SUM(d.population);
dump e; 
© www.soinside.com 2019 - 2024. All rights reserved.