我在尝试使用 Kubernetes Executor 在 Apache Airflow 中运行 HiveOperator 任务时遇到问题。
我有一个 Dockerfile,在其中安装了必要的依赖项,包括 apache-airflow-providers-apache-hive==6.4.1:
Dockerfile
FROM apache/airflow:2.8.2
COPY requirements.txt /
RUN pip install --no-cache-dir "apache-airflow==${AIRFLOW_VERSION}" -r /requirements.txt
RUN umask 0002; \
mkdir -p /tmp
在我的 Airflow 任务中,我定义了一个 HiveOperator,如下所示:
hive_select = HiveOperator(
task_id='hive_select',
hive_cli_conn_id='hive_conn',
hql="select * from table LIMIT 10",
execution_timeout=timedelta(minutes=30)
)
但是,当我尝试执行此任务时,遇到以下错误:
[2024-02-29, 09:05:15 UTC] {hive.py:275} INFO - hive -hiveconf airflow.ctx.dag_id=hive_con_test -hiveconf airflow.ctx.task_id=hive_select -hiveconf airflow.ctx.execution_date=2024-02-29T09:02:14.534414+00:00 -hiveconf airflow.ctx.try_number=4 -hiveconf airflow.ctx.dag_run_id=manual__2024-02-29T09:02:14.534414+00:00 -hiveconf airflow.ctx.dag_owner=airflow -hiveconf airflow.ctx.dag_email= -hiveconf mapred.job.name=Airflow HiveOperator task for hive-con-test-hive-select-g7oo5n2s.hive_con_test.hive_select.2024-02-29T09:02:14.534414+00:00 -f /tmp/airflow_hiveop_wg4va81k/tmpwyi3r0ow
[2024-02-29, 09:05:15 UTC] {taskinstance.py:2728} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
return execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/apache/hive/operators/hive.py", line 172, in execute
self.hook.run_cli(hql=self.hql, schema=self.schema, hive_conf=self.hiveconfs)
File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/apache/hive/hooks/hive.py", line 276, in run_cli
sub_process: Any = subprocess.Popen(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/subprocess.py", line 1026, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/local/lib/python3.11/subprocess.py", line 1953, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: 'hive'
[2024-02-29, 09:05:15 UTC] {taskinstance.py:1149} INFO - Marking task as FAILED. dag_id=hive_con_test, task_id=hive_select, execution_date=20240229T090214, start_date=20240229T090514, end_date=20240229T090515
[2024-02-29, 09:05:15 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 214 for task hive_select ([Errno 13] Permission denied: 'hive'; 28)
[2024-02-29, 09:05:15 UTC] {local_task_job_runner.py:234} INFO - Task exited with return code 1
[2024-02-29, 09:05:15 UTC] {taskinstance.py:3309} INFO - 0 downstream tasks scheduled from follow-on schedule check
PermissionError: [Errno 13] Permission denied: 'hive'
似乎存在与执行 Hive CLI 相关的权限问题。我尝试按照某些资源中的建议设置 /tmp/ 文件夹的权限,但我不确定是否正确执行。
任何关于如何解决此权限问题并成功运行 HiveOperator 任务的见解将不胜感激。
ROM apache/airflow:2.8.2
复制需求.txt /
运行 pip install --no-cache-dir "apache-airflow==${AIRFLOW_VERSION}" -r /requirements.txt
运行 umask 0002;
mkdir -p /tmp