无法从 aws emr studio 笔记本或控制台中读取 s3 文件

问题描述 投票:0回答:0

我们有一个 EMR Studio,它设置了 S3 默认存储桶和文件路径,即

s3://OurBucketName/Subdirectory/work
,并且我们在其中创建了一个工作区,该工作区连接到运行 emr-6.10.0 并安装了以下应用程序的 EC2 集群:

  • Hadoop 3.3.3
  • 蜂巢 3.1.3
  • 色调 4.10.0
  • JupyterEnterpriseGateway 2.6.0
  • JupyterHub 1.5.0
  • MXNet 1.9.1
  • 猪 0.17.0
  • 急速 0.278
  • 星火 3.3.1
  • TensorFlow 2.11.0
  • 飞艇 0.10.1

我们可以在工作区的 (bash) 终端内查看、读取和写入文件,该终端似乎包含位于

OurBucketName/Subdirectory/work
位置的
/home/notebook/work
S3 存储桶内所有内容的副本。也就是说,我们无法从任何控制台或笔记本中读取或写入文件。

我们尝试了多种文件路径,包括相对路径(

~ProjectName/data/filename.csv
)、绝对路径(
/home/notebook/ProjectName/data/filename.csv
)、S3(
s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv
)、共享链接(
https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv
)和下载链接(
https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv?_xsrf=<base64>
)。

目标文件肯定存在,可以在左侧的文件浏览器中看到,并且可以从终端内或由它执行的任何脚本打开/读取/修改。

offices <- read.csv("~ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
来自 SparkR 控制台/笔记本返回
[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file '~ProjectName/data/filename.csv': No such file or directory

offices <- read.csv("/home/notebook/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
来自 SparkR 控制台/笔记本返回
[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file '/home/notebook/ProjectName/data/filename.csv': No such file or directory

offices <- read.csv("s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
来自 SparkR 控制台/笔记本返回
[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file 's3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv': No such file or directory

offices <- read.csv("https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
offices <- read.csv("https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv?_xsrf=<base64>", header = TRUE, sep = ",", quote = "\"",dec = ".")
两者都没有错误地运行但读取了一个空白的 html 文件,这样
summary(offices)
来自 SparkR 控制台/笔记本返回
X..DOCTYPE.html.   Length:28          Class :character   Mode  :character 

看起来关联的(Python、PySpark、Spark 或 SparkR)内核正在

mnt
驱动器之一的某个容器中运行,因为
getwd
从 SparkR 控制台/笔记本返回
/mnt1/yarn/usercache/livy/appcache/application_1678485106748_0005/container_1678485106748_0005_01_000001

amazon-s3 amazon-emr aws-emr-studio
© www.soinside.com 2019 - 2024. All rights reserved.