我正在尝试将 DF 转换为 RDD,但它抛出错误:JavaObject 不可调用。
我遵循了这方面的教程,并尝试了几种不同的方法来尝试让它工作。我相信这可能是配置问题。
(在 Windows 上:spark-3.3.2-bin-hadoop2 和 winutils.exe - Jupyter Notebook)
代码
rf = RandomForestClassifier(featuresCol='Final Feature Vector', labelCol='slice Type')
rf = rf.fit(train)
pred_train_df = rf.transform(train).withColumnRenamed('rawPrediction', 'Pred Slice Type')
pred_test_df = rf.transform(test).withColumnRenamed('rawPrediction', 'Pred Slice Type')
pred_test_df.show(5)
predictions_and_actuals = pred_test_df[["Pred Slice Type", 'slice Type']]`
predictions_and_actuals_rdd = predictions_and_actuals.rdd
predictions_and_actuals_rdd.take(2) #Error Occuring
追溯
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[23], line 6
4 predictions_and_actuals = pred_test_df[["Pred Slice Type", 'slice Type']]
5 predictions_and_actuals_rdd = predictions_and_actuals.rdd
----> 6 predictions_and_actuals_rdd.take(2)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pyspark\rdd.py:2836, in RDD.take(self, num)
2833 taken += 1
2835 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 2836 res = self.context.runJob(self, takeUpToNumLeft, p)
2838 items += res
2839 partsScanned += numPartsToTry
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pyspark\context.py:2319, in SparkContext.runJob(self, rdd, partitionFunc, partitions, allowLocal)
2317 mappedRDD = rdd.mapPartitions(partitionFunc)
2318 assert self._jvm is not None
-> 2319 sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
2320 return list(_load_from_socket(sock_info, mappedRDD._jrdd_deserializer))
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pyspark\rdd.py:5441, in PipelinedRDD._jrdd(self)
5438 else:
5439 profiler = None
-> 5441 wrapped_func = _wrap_function(
5442 self.ctx, self.func, self._prev_jrdd_deserializer, self._jrdd_deserializer, profiler
5443 )
5445 assert self.ctx._jvm is not None
5446 python_rdd = self.ctx._jvm.PythonRDD(
5447 self._prev_jrdd.rdd(), wrapped_func, self.preservesPartitioning, self.is_barrier
5448 )
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pyspark\rdd.py:5243, in _wrap_function(sc, func, deserializer, serializer, profiler)
5241 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
5242 assert sc._jvm is not None
-> 5243 return sc._jvm.SimplePythonFunction(
5244 bytearray(pickled_command),
5245 env,
5246 includes,
5247 sc.pythonExec,
5248 sc.pythonVer,
5249 broadcast_vars,
5250 sc._javaAccumulator,
5251 )
TypeError: 'JavaPackage' object is not callable
我期望的结果是将 Data Frame 转换为 RDD。
我对 pyspark 中的 RDD 有同样的错误