生产Hadoop查询需要花费大量时间

Question

当前状态

我们有一个运行2个多小时的查询。在检查进度时，查询在表T5的连接期间和查询的最后阶段花费了大量时间。有什么办法可以简化这个查询吗？我无法使用聚合函数代替rank（），因为使用的orderby有点复杂。

我们已经尝试过的

我们已经将子查询转换为select子句中的case语句，并且有助于减少执行时间，但这并不重要。我们简化了T3，T4和T6的共同相关查询。

SELECT * FROM 
        (SELECT T2.f1, T2.f2 .... T5.f19, T5.f20, 
                   case when T1.trxn_id is null then T2.crt_ts
                        when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts >= T5.crt_ts then T2.crt_ts
                        when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts < T5.crt_ts then T5.crt_ts
                    end as crt_ts , 
                    row_number() over ( partition by T2.w_trxn_id,
                                            if(T1.trxn_id is null, 'NULL', T1.trxn_id)
                                            order by T2.business_effective_ts desc,
                                            case when T1.trxn_id is null then T2.crt_ts
                                            when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts >= T5.crt_ts then T2.crt_ts
                                            when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts < T5.crt_ts then T5.crt_ts
                                            when T1.trxn_id is not null and T5.acct_trxn_id is null then T2.crt_ts end desc
                                        ) as rnk
                FROM(SELECT * FROM T3 WHERE title_name = 'CAPTURE' and tr_dt IN (SELECT tr_dt FROM DT_LKP))
                T2
                LEFT JOIN (SELECT * FROM T6 WHERE tr_dt IN (SELECT tr_dt FROM DT_LKP)) 
                T1 ON T2.w_trxn_id = T1.w_trxn_id AND T2.business_effective_ts = T1.business_effective_ts
                LEFT JOIN (SELECT f1, f3. ... f20 FROM T4 WHERE tr_dt IN (SELECT tr_dt FROM DT_LKP)) 
                T5 ON T1.trxn_id = T5.acct_trxn_id
                WHERE if(T1.trxn_id is null, 'NULL', T1.trxn_id) = if(T5.acct_trxn_id is null, 'NULL', T5.acct_trxn_id)
        ) FNL WHERE rnk = 1

Answer 1

不确定这对你有多大帮助。有一些相当奇怪的WHERE子句：

WHERE if(T1.trxn_id is null, 'NULL', T1.trxn_id) = if(T5.acct_trxn_id is null, 'NULL', T5.acct_trxn_id)

这可能是为了加入NULLs以及正常值。然后它不起作用，因为首先连接条件是T5 ON T1.trxn_id = T5.acct_trxn_id这意味着NULL没有连接，然后WHERE在连接后作为过滤器。如果没有连接T5，那么T5.acct_trxn_id在WHERE中转换为'NULL'字符串，并与NOT NULL T1.trxn_id值进行比较，并且很可能被过滤掉，在这种情况下就像INNER JOIN一样。如果它发生T1.trxn_id是NULL（驱动表），它转换为字符串'NULL'并与always字符串'NULL'进行比较（因为根据ON子句无论如何都没有加入）并且这样的行被传递（我没有测试它虽然）。逻辑看起来很奇怪，我认为它不能按预期工作或转换为INNER。如果要加入包括NULL的所有内容，请将此WHERE移动到JOIN ON子句。

如果有很多行有NULL，那么使用字符串'NULL'替换的NULL连接将使行相乘并导致重复。

实际上在调查JOIN性能不佳时，请检查两件事：

加入密钥不是重复或预期重复
连接键（以及row_number中的列分区）不会偏斜，请参阅：https://stackoverflow.com/a/53333652/2700344和this：https://stackoverflow.com/a/51061613/2700344

如果一切看起来都很好，那么调整适当的减速器并行度，减少hive.exec.reducers.bytes.per.reducer以使更多的减速器运行

也尽可能减少DT_LKP，即使你知道它包含一些绝对不是/不应该是实际表格的日期，如果可能的话使用CTE过滤它。

还可以简化逻辑（这不会提高性能，但会简化代码）。案例在选择中：

when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts >= T5.crt_ts then T2.crt_ts
when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts < T5.crt_ts then T5.crt_ts

<=>

else greatest(T2.trxn_id,T5.crt_ts)

如果T5.crt_ts为null，则case语句将返回null，maximum（）也将返回null

row_number中的CASE语句简化：

case when case when (T1.trxn_id is null) or (T5.acct_trxn_id is null) then T2.crt_ts
     else greatest(T2.trxn_id,T5.crt_ts)
 end

这也是：if(T1.trxn_id is null, 'NULL', T1.trxn_id) <=> NVL(T1.trxn_id,'NULL')

当然这些只是建议，我没有测试它们

生产Hadoop查询需要花费大量时间

问题描述投票：1回答：1

1个回答

最新问题

生产Hadoop查询需要花费大量时间

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1