并行使用IN子句参数执行Hive查询

Question

我有一个Hive查询，如下所示：

select a.x as column from table1 a where a.y in (<long comma-separated list of parameters>)
union all
select b.x as column from table2 b where b.y in (<long comma-separated list of parameters>)

我已将hive.exec.parallel设置为true，这有助于我在union all之间实现两个查询之间的并行性。

但是，我的IN子句有许多逗号分隔值，每个值在1个作业中取一次，然后是下一个值。这实际上是按顺序执行的。

是否有任何hive参数，如果启用它可以帮助我并行获取IN子句中的参数数据？

目前，我所拥有的解决方案是使用=多次激活select查询而不是一个IN子句。

Answer 1

不需要在单独的查询中多次读取相同的数据以实现更好的并行性。调整适当的映射器和reducer并行性。

首先，使用矢量化启用PPD，使用CBO和Tez：

SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.execution.engine=tez;
SET hive.stats.fetch.column.stats=true;
SET hive.tez.auto.reducer.parallelism=true;

Tez上Mappers的示例设置：

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set tez.grouping.max-size=32000000;
set tez.grouping.min-size=32000;

如果您决定在MR而不是Tez上运行Mappers的示例设置：

set mapreduce.input.fileinputformat.split.minsize=32000; 
set mapreduce.input.fileinputformat.split.maxsize=32000000;

- reducer的示例设置：

set hive.exec.reducers.bytes.per.reducer=32000000; --decrease this to increase the number of reducers, increase to reduce parallelism

玩这些设置。成功标准是更多的映射器/缩减器，并且您的map和reduce阶段运行得更快。

阅读本文以更好地理解如何调整Tez：https://community.hortonworks.com/articles/14309/demystify-tez-tuning-step-by-step.html

并行使用IN子句参数执行Hive查询

问题描述投票：3回答：1

1个回答

最新问题

并行使用IN子句参数执行Hive查询

问题描述 投票：3回答：1

1个回答

最新问题

问题描述投票：3回答：1