Pig脚本用于计算总数,百分比和group by

问题描述 投票:0回答:1

Pig脚本相对较新。我在下面的脚本中导出按错误代码,名称及其各自计数分组的错误详细信息。

A = LOAD 'traffic_error_details.txt' USING 
    PigStorage(',')  as (id:int, error_code:chararray,error_name:chararray, error_status:int);
B = FOREACH A GENERATE A.error_code as errorCode,A.error_name as 
errorName,A.error_status as errorStatus;

C = GROUP B by ($0,$1,$2);
F = FOREACH C GENERATE group, COUNT(B) as count;
Dump F;

以上将给出如下结果:

  1. INVALID_PARAM,REQUEST_ERROR,10
  2. INTERNAL_ERROR,SERVER_ERROR,15
  3. NOT_ALLOWED,ACCESS_ERROR,4
  4. UNKNOWN_ERR,UNKNOWN_ERROR,10
  5. 蓝色,蓝色,ga

我想要显示错误的百分比。如下所示:

  1. INVALID_PARAM,REQUEST_ERROR,10,20%
  2. INTERNAL_ERROR,SERVER_ERROR,15,30%
  3. NOT_ALLOWED,ACCESS_ERROR,4.9%
  4. UNKNOWN_ERR,UNKNOWN_ERROR,10,20%
  5. 蓝色,蓝色,灰色,21%

此处考虑的请求总数为50.其中21%成功。剩余是错误%的拆分。那么如何在同一个脚本和同一个元组中计算总数呢?这样%可以计算为(计数/总计)* 100。 Total指的是所有记录的计数error_details.txt。

apache-pig
1个回答
0
投票

在获得每个错误代码的计数后,您需要执行GROUP ALL以查找错误总数并将该字段添加到每一行。然后,您可以将错误代码计数除以总计数以查找百分比。确保将计数变量从类型long转换为类型double以避免任何整数除法问题。

这是代码:

A = LOAD 'traffic_error_details.txt' USING PigStorage(',') as 
    (id:int, errorCode:chararray, errorName:chararray, errorStatus:int);
B = FOREACH A GENERATE errorCode, errorName, errorStatus;
C = GROUP B BY (errorCode, errorName, errorStatus);
D = FOREACH C GENERATE 
    FLATTEN(group) AS (errorCode, errorName, errorStatus),
    COUNT(B) AS num;
E = GROUP D ALL;
F = FOREACH E GENERATE 
    FLATTEN(D) AS (errorCode, errorName, errorStatus, num),
    SUM(D.num) AS num_total;
G = FOREACH F GENERATE 
    errorCode, 
    errorName, 
    errorStatus,
    num,
    (double)num/(double)num_total AS percent;

您会注意到我稍微修改了您的代码。我按(errorCode, errorName, errorStatus)而不是($0,$1,$2)分组。如果您将来修改代码并且位置不同,那么引用字段名称本身而不是它们的位置会更安全。

© www.soinside.com 2019 - 2024. All rights reserved.