完成以下操作后
devtools::install_github('apache/[email protected]', subdir='R/pkg', force = TRUE)
library(SparkR)
我运行它以将我的数据转换为 spark DataFrame
as.DataFrame(value1)
但是,我收到以下错误信息
getSparkSession() 错误:SparkSession 未初始化
所以,我运行了这个..
sparkR.session()
提示如下:
Will you download and install (or reuse if it exists) Spark package under the cache [/home/analytics/.cache/spark]? (y/n):
如果我点击否,我得到这个...
Error in sparkCheckInstall(sparkHome, master, deployMode) :
Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
但是,如果我点击是,我会收到如下 longggg 消息:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://dlcdn.apache.org/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
- https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
simpleWarning in download.file(remotePath, localPath): cannot open URL 'https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz': HTTP status was '404 Not Found'
To use backup site...
Downloading spark-3.3.0 for Hadoop 2.7 from:
- http://www-us.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'http://www-us.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
simpleWarning in download.file(remotePath, localPath): URL 'http://www-us.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz': status was 'Couldn't resolve host name'
- Unable to download from default mirror site: http://www-us.apache.org/dist/spark
Error in robustDownloadTar(mirrorUrl, version, hadoopVersion, packageName, :
Unable to download Spark spark-3.3.0 for Hadoop 2.7. Please check network connection, Hadoop version, or provide other mirror sites.
如何消除这个错误??
根据我的理解,您的系统中还需要 Spark 包。
Spark 可以使用这些链接安装:下载 Spark 3.3.0,下载 Hadoop 3.0.0,Java OpenJDK 11.0.13 LTS.
设置系统环境变量
SPARK_HOME
为之前下载的Spark 3.3.0目录;并类似地设置HADOOP_HOME
和JAVA_HOME
.
然后运行下面的 R 脚本来加载
SparkR
库,方法是将 <spark-lib-path>
更新为之前下载的解压 Spark 安装目录。
library(SparkR, lib.loc = .libPaths(c(file.path('<spark-lib-path>', 'R', 'lib'), .libPaths())))
我之前尝试时,这些步骤对我有用,我将 Spark 3.1.2 与 Hadoop 2.7.4 一起使用。