Spark RDD.pipe FileNotFoundError: [WinError 2] 系统找不到指定的文件

问题描述 投票:0回答:1

我的目标是通过 RDD.pipe 从 pyspark 调用外部(dotnet)进程。由于这失败了,我想测试通过管道传输到一个简单的命令:

spark = SparkSession.builder.master("local").appName("test").getOrCreate()
result_rdd = spark.sparkContext.parallelize(['1', '2', '', '3']).pipe(command).collect()

但是,我收到错误消息:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) ( executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\projectpath\.venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 686, in main
  File "C:\projectpath\.venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 676, in process
  File "C:\projectpath\.venv\lib\site-packages\pyspark\rdd.py", line 540, in func
    return f(iterator)
  File "C:\projectpath\.venv\lib\site-packages\pyspark\rdd.py", line 1117, in func
    pipe = Popen(shlex.split(command), env=env, stdin=PIPE, stdout=PIPE)
  File "C:\Users\username\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\username\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 1420, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

我的本地设置是定义了 JAVA_HOME 和 HADOOP_HOME,但使用 pyspark 包并且没有定义 SPARK_HOME。

apache-spark hadoop pyspark pipe
1个回答
0
投票

更新:我找到了一个解决方法,使其适合我。我查看了 pyspark 实现,如果没有给出 env 参数,他们会使用一个空字典作为 Popen 的 env 参数,这在直接为 Popen 执行此操作时导致了同样的错误。现在只需添加一些带有值的字典即可解决问题:

pipe(command,  env={"1":"2"})
© www.soinside.com 2019 - 2024. All rights reserved.