在具有以下配置的作业集群上:
Driver: Standard_E8ds_v5
Workers: Standard_E8ds_v5
30 workers
11.3 LTS Photon (includes Apache Spark 3.3.0, Scala 2.12)
我们大约有 5% 的时间会遇到
Futures timed out after [5 seconds]
错误,堆栈跟踪显示在底部。我希望堆栈跟踪足以让某人告诉我应该调整哪些 Spark 配置来延长这 5 秒超时。
该工作的笔记本是这样做的:
def RunChild(s):
dbutils.notebook.run("./ProcessChild", 0, {"param": s})
scenarios = [ some array with 107 items]
with ThreadPoolExecutor(max_workers=20) as executor:
final = executor.map(RunChild, scenarios)
ProcessChild 笔记本每次在 Spark 代码的不同位上频繁失败,并出现以下堆栈跟踪和错误:
java.util.concurrent.TimeoutException: Futures timed out after [5 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at scala.concurrent.Await$.$anonfun$result$1(package.scala:223)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:57)
at scala.concurrent.Await$.result(package.scala:146)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$RequestStatus.waitForReply(JupyterDriverLocal.scala:209)
at com.databricks.backend.daemon.driver.JupyterDriverLocal.repl(JupyterDriverLocal.scala:971)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$23(DriverLocal.scala:725)
at com.databricks.unity.EmptyHandle$.runWith(UCSHandle.scala:103)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$20(DriverLocal.scala:708)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:398)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:147)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:396)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:393)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:62)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:441)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:426)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:62)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:685)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:622)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:614)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:533)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:568)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:438)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:381)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:232)
at java.lang.Thread.run(Thread.java:750)
集群肯定忙于所有并行线程和操作,我们想知道哪个 Spark 配置可以延长 5 秒超时。
我们也遇到了同样的问题,但更令人烦恼的是,在我们的实例中,我们在 10.4LTS 上运行得非常好。简单地升级运行时并重新触发会导致 Futures 在五秒后超时。
对于我们来说,我们能够使用以下命令将广播加入超时从 -1000 增加到 300000(5 分钟)。
spark.conf.get("spark.sql.broadcastTimeout")
spark.conf.set("spark.sql.broadcastTimeout", '300000ms')
这必须是一个临时解决方案(至少对我们来说哈哈),但希望它能让您摆脱眼前的麻烦。我们最终还增加了集群的大小以帮助使用内存。
鉴于这是由广播连接引起的,作为我们测试的一部分,我们还能够禁用自动广播连接,看看这是否有帮助(确实有帮助,但很不稳定):
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
参考 https://spark.apache.org/docs/latest/sql-performance-tuning.html
清除笔记本的状态对我有用。
如果存在一些不需要的显示或任何其他不需要的操作,请将其删除并清除状态。这对我有用