我们有一个 EMR Studio,它设置了 S3 默认存储桶和文件路径,即
s3://OurBucketName/Subdirectory/work
,并且我们在其中创建了一个工作区,该工作区连接到运行 emr-6.10.0 并安装了以下应用程序的 EC2 集群:
我们可以在工作区的 (bash) 终端内查看、读取和写入文件,该终端似乎包含位于
OurBucketName/Subdirectory/work
位置的 /home/notebook/work
S3 存储桶内所有内容的副本。也就是说,我们无法从任何控制台或笔记本中读取或写入文件。
我们尝试了多种文件路径,包括相对路径(
~ProjectName/data/filename.csv
)、绝对路径(/home/notebook/ProjectName/data/filename.csv
)、S3(s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv
)、共享链接(https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv
)和下载链接(https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv?_xsrf=<base64>
)。
目标文件肯定存在,可以在左侧的文件浏览器中看到,并且可以从终端内或由它执行的任何脚本打开/读取/修改。
offices <- read.csv("~ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
来自 SparkR 控制台/笔记本返回
[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file '~ProjectName/data/filename.csv': No such file or directory
offices <- read.csv("/home/notebook/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
来自 SparkR 控制台/笔记本返回
[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file '/home/notebook/ProjectName/data/filename.csv': No such file or directory
offices <- read.csv("s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
来自 SparkR 控制台/笔记本返回
[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file 's3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv': No such file or directory
offices <- read.csv("https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
和
offices <- read.csv("https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv?_xsrf=<base64>", header = TRUE, sep = ",", quote = "\"",dec = ".")
两者都没有错误地运行但读取了一个空白的 html 文件,这样
summary(offices)
来自 SparkR 控制台/笔记本返回
X..DOCTYPE.html. Length:28 Class :character Mode :character
看起来关联的(Python、PySpark、Spark 或 SparkR)内核正在
mnt
驱动器之一的某个容器中运行,因为
getwd
从 SparkR 控制台/笔记本返回/mnt1/yarn/usercache/livy/appcache/application_1678485106748_0005/container_1678485106748_0005_01_000001