我对 python 和 pyspark 很陌生,我有一个项目要做,我们正在 google collab 上使用 pyspark,我正在使用下面的代码,直到今天我似乎无法再安装 Spark 了。如果有人可以帮助我,我将非常感激!
我正在使用此代码:
!wget -q https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
!tar -xzf spark-3.5.0-bin-hadoop3.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
import findspark
findspark.init()
我现在收到这个错误:
tar (child): spark-3.5.0-bin-hadoop3.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
158 try:
--> 159 py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
160 except IndexError:
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
1 frames
<ipython-input-20-a05079ca17b7> in <cell line: 12>()
10
11 import findspark
---> 12 findspark.init()
/usr/local/lib/python3.10/dist-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
159 py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
160 except IndexError:
--> 161 raise Exception(
162 "Unable to find py4j in {}, your SPARK_HOME may not be configured correctly".format(
163 spark_python
Exception: Unable to find py4j in /content/spark-3.5.0-bin-hadoop3/python, your SPARK_HOME may not be configured correctly
在做了更多研究后,我发现有一个新版本,我正在使用的网站突然变得无效,所以这是对我有用的新代码:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
!tar xf Spark-3.5.1-bin-hadoop3.tgz
!pip install -q findspark
导入操作系统
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"
导入findspark
findspark.init()