如何分组而不重复-Apache Pig

问题描述 投票:0回答:1

我需要找到订单价值(单价乘以产品数量)。但是,我的结果显示order_id重复。如何删除重复项,使我得到order_id后跟订单值?任何帮助表示赞赏!谢谢!

代码:

orderdetails = load '/user/bigdata/order_detail.tbl' using PigStorage('|') as

(ORDER_ID:int,PRODUCT_ID:int,CUSTOMER_ID:int,SALESPERSON_ID:int,UNIT_PRICE:float,QUANTITY:int,DISCOUNT:float);

ordervalue = FOREACH orderdetails GENERATE ORDER_ID, UNIT_PRICE*QUANTITY as VALUE;

order_filter = FILTER ordervalue BY (ORDER_ID > 10269) AND (ORDER_ID < 10280);

groupOrder = GROUP order_filter BY (ORDER_ID);

groupOrdersum = FOREACH groupOrder GENERATE (order_filter.ORDER_ID),SUM(order_filter.VALUE) as ORDERVALUE;

dump groupOrdersum;

结果

({(10270),(10270)},1376.0) ({(10271)},48.0) ({(10272),(10272),(10272)},1455.9999694824219) ({(10273),(10273),(10273),(10273),(10273)},2142.399932861328) ({(10274),(10274)},538.5999908447266) ({(10275),(10275)},307.1999969482422) ({(10276),(10276)},420.0) ({(10277),(10277)},1200.8000183105469) ({(10278),(10278),(10278),(10278)},1488.7999877929688) ({(10279)},468.0)
apache-spark hadoop apache-pig
1个回答
0
投票

我认为您需要更改:

groupOrdersum = FOREACH groupOrder GENERATE (order_filter.ORDER_ID),SUM(order_filter.VALUE) as ORDERVALUE;

groupOrdersum = FOREACH groupOrder GENERATE 
    group AS ORDERID,
    SUM(order_filter.VALUE) as ORDERVALUE;

我想现在您正在生成的是订单ID(即值)的分组包,而不是实际的密钥。

© www.soinside.com 2019 - 2024. All rights reserved.