Spark SparkFiles.get() 返回驱动程序路径而不是工作程序路径

Question

我正在通过外部可执行文件管道传输 RDD 的分区。我使用sparkContext.addFiles()，以便工作人员可以使用可执行文件。

当我尝试运行代码时，我收到可执行文件的 FileNotFound 异常，并且当我期望它返回工作位置时，运行 SparkFiles.getfile() 似乎会返回驱动程序位置。

spark.sparkContext.addFile("/dbfs/FileStore/Custom_Executable")
files_rdd  = spark.sparkContext.parallelize(files_list)

print(f'spark file path:{SparkFiles.get("Custom_Executable")}')

path_with_params =  SparkFiles.get("Custom_Executable") + " the-function-name  --to 
company1 --from  company2 -input - -output -"

print(f'path with params: {path_with_params}')

pipe_rdd = files_rdd.pipe(path_with_params,  env={'SOME_ENV_VAR': env_var_val})

print(pipe_tokenised_rdd.collect())

这个输出..

spark file path:/local_disk0/spark-c69e5328-9da3-4c76-85b8-a977e470909d/userFiles- 
e8a37109-046c-4909-8dd2-95bde5c9f3e3/Custom_Executable
exe path: /local_disk0/spark-c69e5328-9da3-4c76-85b8-a977e470909d/userFiles-e8a37109- 
046c-4909-8dd2-95bde5c9f3e3/Custom_Executable the-function-name  --to company1 --from  
company2 -input - -output -
Output
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 
0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 12) 
(10.113.4.168 executor 0): org.apache.spark.api.python.PythonException: 
&#39;FileNotFoundError: [Errno 2] No such file or directory: &#39;/local_disk0/spark- 
c69e5328-9da3-4c76-85b8-a977e470909d/userFiles-e8a37109-046c-4909-8dd2- 
95bde5c9f3e3/Custom_Executable&#39;&#39;. Full traceback below:

为什么管道 Sparkfiles.get() 返回驱动程序路径而不是工作路径？

Answer 1

经过多次测试，我注意到Spark文档说：

SparkFiles.get(file_name)
返回工作路径。

但它没有，而是返回驱动程序路径。但是，

sparkContext.addFile()

方法将文件添加到集群中的所有节点。

我不知道为什么会这样。也许这是 Spark 中的一个错误（我使用的是 3.5.0 版本）。

Spark SparkFiles.get() 返回驱动程序路径而不是工作程序路径

问题描述投票：0回答：1

1个回答

最新问题

Spark SparkFiles.get() 返回驱动程序路径而不是工作程序路径

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1