使用 docker-image-tool.sh 创建的 spark-py 图像和 docker hub 中的图像有什么区别

问题描述 投票:0回答:1

spark 在 docker hub 发布 spark-py docker 镜像

https://hub.docker.com/r/apache/spark-py/tags

但是在运行 kubernetes 文档时,他们说您需要使用 docker 镜像工具构建它

https://spark.apache.org/docs/latest/running-on-kubernetes.html

./bin/docker-image-tool.sh -r <repo> -t my-tag -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

使用 docker image 工具创建的(假设使用全新安装)和 docker hub 中的 docker 镜像有区别吗?

apache-spark kubernetes pyspark docker-image
1个回答
0
投票

我有同样的问题,我做了一些研究。

以下是截至今天(2023 年 3 月 20 日)kubernetes/dockerfiles/spark/bindings/python/Dockerfile 的副本:

ARG base_img

FROM $base_img
WORKDIR /

# Reset to root to run installation tasks
USER 0

RUN mkdir ${SPARK_HOME}/python
RUN apt-get update && \
    apt install -y python3 python3-pip && \
    pip3 install --upgrade pip setuptools && \
    # Removed the .cache to save space
    rm -rf /root/.cache && rm -rf /var/cache/apt/* && rm -rf /var/lib/apt/lists/*

COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib

WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
ARG spark_uid=185
USER ${spark_uid}

base_img
指向kubernetes/docker/src/main/dockerfiles/spark/Dockerfile

ARG java_image_tag=17-jre

FROM eclipse-temurin:${java_image_tag}

ARG spark_uid=185

# Before building the docker image, first build and make a Spark distribution following
# the instructions in https://spark.apache.org/docs/latest/building-spark.html.
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
    apt-get update && \
    ln -s /lib /lib64 && \
    apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools && \
    mkdir -p /opt/spark && \
    mkdir -p /opt/spark/examples && \
    mkdir -p /opt/spark/work-dir && \
    touch /opt/spark/RELEASE && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
    rm -rf /var/cache/apt/* && rm -rf /var/lib/apt/lists/*

COPY jars /opt/spark/jars
COPY bin /opt/spark/bin
COPY sbin /opt/spark/sbin
COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY kubernetes/dockerfiles/spark/decom.sh /opt/
COPY examples /opt/spark/examples
COPY kubernetes/tests /opt/spark/tests
COPY data /opt/spark/data

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir
RUN chmod a+x /opt/decom.sh

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
USER ${spark_uid}

以下是截至今天(3/20/2023)的当前apache/spark-py:latest 图像层的副本。

首先您会注意到,定制版本使用的是 Java 17,而这个官方 docker 镜像使用的是 Java 11。

可能会有更多差异。如果您发现更多,请随时编辑此答案!

总的来说,定制版可以给我们更多的自由,比如不需要的话我们也可以去掉

COPY examples /opt/spark/examples

© www.soinside.com 2019 - 2024. All rights reserved.