Pig-针对主表映射并检索两列?

问题描述 投票:0回答:1

我正在openflights数据集(https://openflights.org/data.html)上使用Pig进行实验。我当前正在尝试映射一个包含所有唯一可能的飞行路线的查询,即下表

+---------------+-------------+
| Start_Airport | End_Airport |
+---------------+-------------+
| YYZ           | NYC         |
| YBG           | YVR         |
| AEY           | GOH         |
+---------------+-------------+ 

然后将两个值与一个包含每个机场的经度和纬度的主表结合。即

+---------+----------+-----------+
| Airport | Latitude | Longitude |
+---------+----------+-----------+
| YYZ     |    -10.3 |      1.23 |
| YBG     |    -40.3 |      50.4 |
| AEY     |     30.3 |      30.3 |
+---------+----------+-----------+

我将如何尝试执行此操作?我本质上是在尝试一个看起来像

的决赛桌
+----------------+----------+-----------+-------------+----------+-----------+
| Start_Airport  | Latitude | Longitude | End_Airport | Latitude | Longitude |
+----------------+----------+-----------+-------------+----------+-----------+
| YYZ            |    -10.3 |      1.23 | NYC         | blah     | blah      |
| YBG            |    -40.3 |      50.4 | YVR         | blah     | blah      |
| AEY            |     30.3 |      30.3 | GOH         | blah     | blah      |
+----------------+----------+-----------+-------------+----------+-----------+

我目前正在尝试执行以下操作,其中c是第一个表

route_data = JOIN c by (start_airport, end_airport), airports_all by ($0, $0);

我认为这本质上是为了查询,请针对相应的代码加入starting_aiport和ending_airport,然后遍历相应的经度和纬度,

hadoop apache-pig
1个回答
0
投票

route_data = JOIN c由(start_airport,end_airport),airports_all由($ 0,$ 0);

这类似于sql world中典型联接查询的“ and”条件子句。想象一下下面的查询。它会产生您想要的结果。选择*从c t1在a.start_airport = b.first_field和a.end_airport = b.first_field上加入airport_all t2;仅当start_airport和end_airport相同时,这才是结果。

您的愿望可以通过以下方式实现:

cat > routes.txt
YYZ,NYC
YBG,YVR
AEY,GOH

cat > airports_all.txt
YYZ,-10.3,1.23
YBG,-40.3,50.4
AEY,30.3,30.3

邮政编码:

tab1 = load '/home/ec2-user/routes.txt' using PigStorage(',') as (start_airport,end_airport);
describe tab1
tab2 = load '/home/ec2-user/airports_all.txt' using PigStorage(',') as (Airport,Latitude,Longitude);
describe tab2
tab3 = JOIN tab1 by (start_airport), tab2 by (Airport);
describe tab3
tab4 = foreach tab3 generate $0 as start_airport, $3 as start_Latitude, $4 as start_Longitude, $1 as end_airport;
describe tab4
tab5 = JOIN tab4 by (end_airport), tab2 by (Airport);
describe tab5
tab6 = foreach tab5 generate $0 as start_airport, $1 as start_Latitude, $2 as start_Longitude, $3 as end_airport, $5 as end_Latitude, $6 as end_Longitude;
describe tab6
dump tab6
© www.soinside.com 2019 - 2024. All rights reserved.