pyspark 收集和采取第一个操作时出现异常

问题描述 投票:0回答:1

我是 pyspark 的新手,我正在学习初学者友好的课程,第一步是导入数据并根据分隔符逗号(“,”)进行分割,当我尝试执行收集或采取操作时分割数据后它抛出一个错误,请帮忙。

import os
import sys
from pyspark import SparkContext
sc = SparkContext(master='local',appName='test_advent')
print(sc)
path = r'C:\Users\Documents\spark\Adventure\Adventure_RDD'
file = os.path.join(path,'Employee.csv')
Employee_RDD = sc.textFile(file)
Employee_RDD = Employee_RDD.map(lambda x:x.split(','))
print(Employee_RDD.collect())

错误如下:

Py4JJavaError                             Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 print(Employee_RDD.collect())

File ~\Anaconda3\Anaconda33\lib\site-packages\pyspark\rdd.py:2836, in RDD.take(self, num)
   2833         taken += 1
   2835 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 2836 res = self.context.runJob(self, takeUpToNumLeft, p)
   2838 items += res
   2839 partsScanned += numPartsToTry

File ~\Anaconda3\Anaconda33\lib\site-packages\pyspark\context.py:2319, in SparkContext.runJob(self, rdd, partitionFunc, partitions, allowLocal)
   2317 mappedRDD = rdd.mapPartitions(partitionFunc)
   2318 assert self._jvm is not None
-> 2319 sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
   2320 return list(_load_from_socket(sock_info, mappedRDD._jrdd_deserializer))

File ~\Anaconda3\Anaconda33\lib\site-packages\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File ~\Anaconda3\Anaconda33\lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

**Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 6) (INNF7WQTN3.zone1.scb.net executor driver): java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified**
    at java.lang.ProcessBuilder.start(Unknown Source)
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
    
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
    .. 17 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984)
Caused by: java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
python apache-spark pyspark
1个回答
0
投票

在 Spark 中运行 Python 脚本似乎存在问题。该错误表明 Spark 找不到 Python 可执行文件 (python3)。这可能是由于 Spark 设置中的配置错误造成的。

© www.soinside.com 2019 - 2024. All rights reserved.