[在我们的Pyspark作业中,有一种情况是我们在较大的数据帧和相对较小的数据帧之间进行联接,我相信spark正在使用广播联接,并且遇到了以下错误
org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 8 GB
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:103)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
我尝试通过将'spark.sql.autoBroadcastJoinThreshold':'-1'设置为spark提交的一部分来禁用广播联接
/usr/bin/spark-submit --conf spark.sql.autoBroadcastJoinThreshold=-1 /home/hadoop/scripts/job.py
我尝试使用]打印spark.sql.autoBroadcastJoinThreshold的值>
spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
并且返回-1。但是,即使进行了此更改,我仍然收到错误
org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 8 GB
Spark版本是Spark 2.3.0
感谢您的任何帮助。
在我们的Pyspark作业中,有一种情况是我们在较大数据帧和相对较小数据帧之间进行联接,我相信spark正在使用广播联接,因此我们遇到了...