无法使用 Airflow 2.1.2 DAG 中的 HiveOperator 连接到 Hive

问题描述 投票:0回答:2

我一直在努力从 HiveOperator 任务运行 Hive 查询。 Hive 和 Airflow 安装在 docker 容器中,我可以从 Airflow 容器中的 python 代码查询 Hive 表,也可以通过 Hive CLI 成功查询。但是当我运行 Airflow DAG 时,我看到一条错误,指出找不到 hive/beeline 文件。

DAG:

dag_hive = DAG(dag_id = "hive_script",
          schedule_interval = '* * * * *',
            start_date = airflow.utils.dates.days_ago(1))

hql_query = """
CREATE TABLE IF NOT EXISTS mydb.test_af(
`test` int);
insert into mydb.test_af values (1);
"""

hive_task = HiveOperator(hql = hql_query,
          task_id = "hive_script_task",
            hive_cli_conn_id = "hive_local",
              dag = dag_hive
              )

hive_task

if __name__ == '__main__ ':
      dag_hive.cli()

日志:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1157, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1331, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1361, in _execute_task
    result = task_copy.execute(context=context)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/apache/hive/operators/hive.py", line 156, in execute
    self.hook.run_cli(hql=self.hql, schema=self.schema, hive_conf=self.hiveconfs)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/apache/hive/hooks/hive.py", line 249, in run_cli
    hive_cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=tmp_dir, close_fds=True
  File "/usr/local/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/usr/local/lib/python3.7/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'beeline': 'beeline'
[2021-08-19 12:22:04,291] {taskinstance.py:1551} INFO - Marking task as FAILED. dag_id=***_script, task_id=***_script_task, execution_date=20210819T122100, start_date=20210819T122204, end_date=20210819T122204
[2021-08-19 12:22:04,323] {local_task_job.py:149} INFO - Task exited with return code 1

如果有人帮助我那就太好了。 预先感谢...

python docker hadoop hive airflow
2个回答
1
投票

您需要在 Apache Airflow 镜像中安装 beeline。这取决于您使用的 Airflow 映像,但 Airflow 的“参考”映像仅包含最常见的提供程序,而 hive 不在其中。您应该扩展或自定义图像以添加直线,以便在气流图像的路径中可用。

您可以在 https://airflow.apache.org/docs/docker-stack/build.html#adding-new-apt-package

阅读有关扩展/自定义 Airflow 图像的更多信息

0
投票

这是我的 Dockerfile,基于此处和其他人的回复。您需要下载 hadoop 和 hive 库,解压它们并更新下面的 Dockerfile 以获取正确的版本。下载“bin.tar.gz”文件,而不是“src.tar.gz”文件。从以下位置下载:

https://hadoop.apache.org/releases.html
https://hive.apache.org/general/downloads/

tar -xvzf <filename>

打开它们的包装

构建镜像并启动气流后,您应该能够连接到您的 Hive 实例。我必须将其添加到“额外”框中,但这取决于您的配置单元安装:{“auth_mechanism”:“CUSTOM”}

FROM apache/airflow:2.6.2

# Install OpenJDK-8
USER root

RUN apt-get update && \
  apt-get update && \
  apt-get install -y openjdk-11-jdk && \
  apt-get install -y ant && \
  apt-get clean;

#need to install these packages for the "pip install hive" to work below                 
RUN apt-get install -y --no-install-recommends g++
RUN apt-get install -y --no-install-recommends libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit

USER airflow

# Setup JAVA_HOME -- useful for docker commandline
ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64/

#hadoop
COPY hadoop-3.3.6 /hadoop-3.3.6
ENV HADOOP_HOME=/hadoop-3.3.6

#hive
COPY apache-hive-3.1.3-bin /apache-hive-3.1.3-bin

#upgrade pip as we were getting errors running pip install
RUN python -m pip install --upgrade pip --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org

RUN pip install "apache-airflow==${AIRFLOW_VERSION}" --no-cache-dir --progress-bar off apache-airflow-providers-apache-hive

#run this to confirm files we copied exist
RUN ls -l /

#run this to confirm env vars set correctly
RUN export

#confirm java works
RUN java -XshowSettings:properties -version 2>&1

#confirm beeline works
RUN /apache-hive-3.1.3-bin/bin/beeline --version 2>&1
© www.soinside.com 2019 - 2024. All rights reserved.