PySpark:数据帧到 .rdd 抛出错误

问题描述 投票:0回答:1

我正在尝试将 DF 转换为 RDD,但它抛出错误:JavaObject 不可调用。

我遵循了这方面的教程,并尝试了几种不同的方法来尝试让它工作。我相信这可能是配置问题。

(在 Windows 上:spark-3.3.2-bin-hadoop2 和 winutils.exe - Jupyter Notebook)

代码

rf = RandomForestClassifier(featuresCol='Final Feature Vector', labelCol='slice Type')
rf = rf.fit(train)

pred_train_df = rf.transform(train).withColumnRenamed('rawPrediction', 'Pred Slice Type')
pred_test_df = rf.transform(test).withColumnRenamed('rawPrediction', 'Pred Slice Type')
pred_test_df.show(5)

predictions_and_actuals = pred_test_df[["Pred Slice Type", 'slice Type']]`

predictions_and_actuals_rdd = predictions_and_actuals.rdd
predictions_and_actuals_rdd.take(2) #Error Occuring

追溯

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[23], line 6
      4 predictions_and_actuals = pred_test_df[["Pred Slice Type", 'slice Type']]
      5 predictions_and_actuals_rdd = predictions_and_actuals.rdd
----> 6 predictions_and_actuals_rdd.take(2)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pyspark\rdd.py:2836, in RDD.take(self, num)
   2833         taken += 1
   2835 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 2836 res = self.context.runJob(self, takeUpToNumLeft, p)
   2838 items += res
   2839 partsScanned += numPartsToTry

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pyspark\context.py:2319, in SparkContext.runJob(self, rdd, partitionFunc, partitions, allowLocal)
   2317 mappedRDD = rdd.mapPartitions(partitionFunc)
   2318 assert self._jvm is not None
-> 2319 sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
   2320 return list(_load_from_socket(sock_info, mappedRDD._jrdd_deserializer))

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pyspark\rdd.py:5441, in PipelinedRDD._jrdd(self)
   5438 else:
   5439     profiler = None
-> 5441 wrapped_func = _wrap_function(
   5442     self.ctx, self.func, self._prev_jrdd_deserializer, self._jrdd_deserializer, profiler
   5443 )
   5445 assert self.ctx._jvm is not None
   5446 python_rdd = self.ctx._jvm.PythonRDD(
   5447     self._prev_jrdd.rdd(), wrapped_func, self.preservesPartitioning, self.is_barrier
   5448 )

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pyspark\rdd.py:5243, in _wrap_function(sc, func, deserializer, serializer, profiler)
   5241 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
   5242 assert sc._jvm is not None
-> 5243 return sc._jvm.SimplePythonFunction(
   5244     bytearray(pickled_command),
   5245     env,
   5246     includes,
   5247     sc.pythonExec,
   5248     sc.pythonVer,
   5249     broadcast_vars,
   5250     sc._javaAccumulator,
   5251 )

TypeError: 'JavaPackage' object is not callable

Train table for example if needed.

我期望的结果是将 Data Frame 转换为 RDD。

apache-spark pyspark apache-spark-sql
1个回答
0
投票

我对 pyspark 中的 RDD 有同样的错误

© www.soinside.com 2019 - 2024. All rights reserved.