如何在 Amazon EMR 的 JupyterLab 中使用自定义 Python 版本作为新内核?

问题描述 投票:0回答:1

我使用的是 Amazon EMR 7.x,默认情况下具有 Python 3.9。

我基于

添加了自定义Python 3.11

我将其添加为引导脚本:

#!/usr/bin/env bash
set -e

PYTHON_VERSION=3.11.7
sudo yum --assumeyes install \
  bzip2-devel \
  expat-devel \
  gcc \
  libffi-devel \
  make \
  systemtap-sdt-devel \
  tar \
  zlib-devel
curl --silent --fail --show-error --location "https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tar.xz" | tar -x -J -v
cd "Python-${PYTHON_VERSION}"
export CFLAGS="-march=native"
./configure \
  --enable-loadable-sqlite-extensions \
  --with-dtrace \
  --with-lto \
  --enable-optimizations \
  --with-system-expat \
  --prefix="/usr/local/python${PYTHON_VERSION}"
sudo make altinstall
sudo "/usr/local/python${PYTHON_VERSION}/bin/python${PYTHON_VERSION%.*}" -m pip install --upgrade pip

echo "# Install my Amazon EMR cluster-scoped dependencies"
sudo curl --silent --fail --show-error --location --remote-name --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.4_2.12/1.5.0/sedona-spark-shaded-3.4_2.12-1.5.0.jar
sudo curl --silent --fail --show-error --location --remote-name --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.5.0-28.2/geotools-wrapper-1.5.0-28.2.jar
"/usr/local/python${PYTHON_VERSION}/bin/python${PYTHON_VERSION%.*}" -m pip install \
  apache-sedona[spark]==1.5.0

我有一个验证Python版本的步骤:

import sys
from pyspark.sql import SparkSession

SparkSession.builder.getOrCreate()
print(sys.version_info)
# sys.version_info(major=3, minor=11, micro=7, releaselevel='final', serial=0)
assert (sys.version_info.major, sys.version_info.minor) == (3, 11)

也成功了:

如果我更改代码以与Python版本进行比较

(3, 9)
,它将失败。所以我知道它确实有效。

当我 ssh 进入 EMR 主节点时,我可以看到文件夹

/usr/local/python3.11.7

[hadoop@ip-172-31-177-28 ~]$ cd /usr/local
[hadoop@ip-172-31-177-28 local]$ ls
bin  etc  games  include  lib  lib64  libexec  man  python3.11.7  sbin  share  src

但是,在 JupterLab 中,当我选择 PySpark 内核时,下面的脚本显示我正在使用 Python 3.9:

import sys
print(sys.version_info)
# sys.version_info(major=3, minor=9, micro=16, releaselevel='final', serial=0)

如果我在这个 EMR 集群中的 JupterLab 中打开终端,它会显示

[notebook@ip-10-131-38-159 /]$ cd /usr/local/
[notebook@ip-10-131-38-159 local]$ ls
bin  etc  games  include  lib  lib64  libexec  sbin  share  src

所以我感觉这个 JupterLab 正在作为 Docker 服务运行。

如何在JupterLab中添加Python 3.11?谢谢!

amazon-web-services amazon-emr jupyter-lab
1个回答
0
投票

我发现 JupterLab Python 是独立的。我需要首先使用 Python 3.11 for JupterLab 创建一个新的 conda 环境,然后将其注册为新内核。

这是我更新的引导脚本:

#!/usr/bin/env bash
set -e

PYTHON_VERSION=3.11.7
sudo yum --assumeyes install \
  bzip2-devel \
  expat-devel \
  gcc \
  libffi-devel \
  make \
  systemtap-sdt-devel \
  tar \
  zlib-devel
curl --silent --fail --show-error --location "https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tar.xz" | tar -x -J -v
cd "Python-${PYTHON_VERSION}"
export CFLAGS="-march=native"
./configure \
  --enable-loadable-sqlite-extensions \
  --with-dtrace \
  --with-lto \
  --enable-optimizations \
  --with-system-expat \
  --prefix="/usr/local/python${PYTHON_VERSION}"
sudo make altinstall
sudo "/usr/local/python${PYTHON_VERSION}/bin/python${PYTHON_VERSION%.*}" -m pip install --upgrade pip

echo "# Install my Amazon EMR cluster-scoped dependencies"
sudo curl --silent --fail --show-error --location --remote-name --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.4_2.12/1.5.0/sedona-spark-shaded-3.4_2.12-1.5.0.jar
sudo curl --silent --fail --show-error --location --remote-name --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.5.0-28.2/geotools-wrapper-1.5.0-28.2.jar
"/usr/local/python${PYTHON_VERSION}/bin/python${PYTHON_VERSION%.*}" -m pip install \
  apache-sedona[spark]==1.5.0

echo "# Install my JupyterLab-scoped dependencies"
sudo /emr/notebook-env/bin/conda create --name="python${PYTHON_VERSION}" python=${PYTHON_VERSION} --yes
sudo "/emr/notebook-env/envs/python${PYTHON_VERSION}/bin/python" -m pip install \
  apache-sedona[spark]==1.5.0 \
  attrs==23.1.0 \
  descartes==1.1.0 \
  ipykernel==6.28.0 \
  matplotlib==3.8.2 \
  pandas==2.1.4 \
  shapely==2.0.2

echo "# Add JupyterLab kernel"
sudo "/emr/notebook-env/envs/python${PYTHON_VERSION}/bin/python" -m ipykernel install --name="python${PYTHON_VERSION}"

现在我在 JupterLab 中有一个新的 Python 3.11 内核:

参考:

© www.soinside.com 2019 - 2024. All rights reserved.