Hadoop-Hive-Impala-重写性能查询

Question

我有2个表，下面各列

表1

col1   col2   col3     val
11     221    38       10
null   90     null     989
78     90     null     77

table2

col1   col2   col3  
12     221    78
23     null   67 
78     90     null

我想首先在col1上连接这两个表，如果值匹配，则停止，如果在col2上不连接，则停止，如果匹配停止，否则在col3上连接，如果任何列匹配的话则填充val，否则任何null匹配，然后在matchcol列中填充该列。因此，输出应如下所示：

col1   col2   col3     val     matchingcol
11     221    38       10      col2
null   90     null     null    null
78     90     null     77      col1

我能够使用下面的查询来执行此操作，但是性能非常慢。请让我知道下面是否有更好的书写方式，以实现更快的性能

select *
from table1 t1 left join
     table2 t2_1
     on t2_1.col1 = t1.col1 left join
     table2 t2_2
     on t2_2.col2 = t1.col2 and t2_1.col1 
     left join table2 t2_3 on t2_3.col3 = t1.col3 and t2_2.col2 is null

ps：我之前问过同样的问题，但没有更好的答案

Answer 1

您描述的是：

select t1.col1, t1.col2, t1.col3, 
       (case when t2_1.col1 is not null or t2_2.col1 is not null or t2_3.col1 is not null then t1.val end) as val
       (case when t2_1.col1 is not null then 'col1'
             when t2_2.col2 is not null then 'col2'
             when t2_3.col3 is not null then 'col3'
        end) as matching
from table1 t1 left join
     table2 t2_1
     on t2_1.col1 = t1.col1 left join
     table2 t2_2
     on t2_2.col2 = t1.col2 and t2_1.col1 is null left join
     table2 t2_3
     on t2_3.col3 = t1.col3 and t2_2.col2 is null;

这可能是最好的方法。

Answer 2

如果将查询重写为一串带有后续UNION的INNER JOIN并在col1-colN分区内排名，您可能会获得更好的性能（以利用额外资源为代价）。类似于：

select x.col1, x.col2, x.col3, x.val, x.matchingcol
from (
  select col1, col2, col3, val, matchingcol,
         row_number() over (partition by col1, col2, col3 order by preference) bestmatch
  from (
    select t1.col1 col1, t1.col2 col2, t1.col3 col3, t1.val val, 
           'col1' matchingcol, 1 preference
    from table1 t1 inner join table2 t2
    on t1.col1 = t2.col1
    union all
    select t1.col1 col1, t1.col2 col2, t1.col3 col3, t1.val val, 
           'col2' matchingcol, 2 preference
    from table1 t1 inner join table2 t2
    on t1.col2 = t2.col2
    union all
    select t1.col1 col1, t1.col2 col2, t1.col3 col3, t1.val val, 
           'col3' matchingcol, 3 preference
    from table1 t1 inner join table2 t2
    on t1.col3 = t2.col3
    union all
    select t1.col1 col1, t1.col2 col2, t1.col3 col3, cast(null as int) val, 
           cast(null as string) matchingcol, 4 preference
    from table1 t1 
  ) q
) x
where x.bestmatch = 1

我认为它可能会更好，因为UNION的所有分支都并行执行，并且单个最终洗牌将胜过您在原始查询中产生的多个顺序洗牌。但是，当然还有其他因素可能会影响最终结果，例如资源可用性，数据量，形状，存储格式等。

Hadoop-Hive-Impala-重写性能查询

问题描述投票：2回答：2

2个回答

最新问题

Hadoop-Hive-Impala-重写性能查询

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2