Hadoop PIG与嵌套的Json

问题描述 投票:0回答:1

我有一个按用户评分的电影列表。

    {"_id":59607,"title":"King Corn (2007)",
     "genres":["Documentary"],
     "ratings":[ {"userId":1860,"rating":3},
                {"userId":9970,"rating":3.5},
                {"userId":16929,"rating":1.5},
                {"userId":23473,"rating":4},
                {"userId":23733,"rating":4},
                {"userId":27584,"rating":3},
                {"userId":28232,"rating":4},
                {"userId":29482,"rating":3},
                {"userId":40976,"rating":5},
                {"userId":44631,"rating":4},
                {"userId":47613,"rating":3},
                {"userId":49763,"rating":3},
                {"userId":58160,"rating":4.5},
                {"userId":62249,"rating":3},
                {"userId":65923,"rating":4},
                {"userId":67507,"rating":4},
                {"userId":68259,"rating":3.5},
                {"userId":70331,"rating":5},
                {"userId":71420,"rating":3.5}
        ]
    }

我需要计算每个用户完成的评分数量。这是我试图进入收视率。

a = load '/movies_1m.json' using JsonLoader('id:int, title : chararray, genres : { ( genre : chararray ) }, ratings: { ( userId : int, rating: float) } ');

然后

b = FOREACH a GENERATE FLATTEN(ratings);

描述给我以下:

b: {ratings::userId: int,ratings::rating: float}

只是为了计算我需要访问评级内部的用户。但这是它没有取得成功的关键。我试过这个:

c = FOREACH b GENERATE COUNT(ratings);

它让我错了。

我需要得到这样的东西:

 {userId: int, rating: float}
json hadoop apache-pig
1个回答
0
投票

你需要GROUP以便COUNT,因为这是一个集合操作。

b = FOREACH a GENERATE FLATTEN(ratings);
gr = GROUP b by ratings::userId;
c = FOREACH gr GENERATE group,COUNT($1);
\d c

产量

请注意,示例中没有任何用户重复,因此这些都是一个。

(1860,1)
(9970,1)
(16929,1)
(23473,1)
(23733,1)
(27584,1)
(28232,1)
(29482,1)
(40976,1)
(44631,1)
(47613,1)
(49763,1)
(58160,1)
(62249,1)
(65923,1)
(67507,1)
(68259,1)
(70331,1)
(71420,1)
© www.soinside.com 2019 - 2024. All rights reserved.