Spark Cluster 有心跳超时和 Yarn Scheduler

问题描述 投票:0回答:1

我检查了 Stackoverflow 上的其他相关帖子,但似乎没有一个与我的直接相关,因为他们引用的 Spark 版本已经超过 5 年前了而我使用的 Spark 版本是 Spark 3.3.3

我正在运行 Apache Spark 集群,以 Yarn 作为主节点,同时使用 Jupyter Labs 作为 IDE。当我运行命令来启动集群时,它开始使用:

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.3'). \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Kafka and Spark Integration'). \
    master('yarn'). \
    getOrCreate()

计划是从 HDFS 中的 24 个 json 文件读取流,每个文件大约 80MB,然后写入流,同时将数据按三列分区,并将它们作为 parquet 存储在 HDFS 中的另一个文件夹中。

这是我使用的命令:

file_df. \
    writeStream. \
    partitionBy('created_year', 'created_month', 'created_dayofmonth'). \
    format('parquet'). \
    option("checkpointLocation", f"/user/{username}/file_df/streaming/checkpoint/file_df"). \
    option("path", f"/user/{username}/file_df/streaming/data/files_parq"). \
    trigger(once=True). \
    start()

然后当我在笔记本上运行它时得到这个输出

23/09/13 21:34:55 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.

<pyspark.sql.streaming.StreamingQuery at 0x7fd22851adc0>

[Stage 3:===============>                                          (5 + 2) / 19]

23/09/13 21:42:56 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 120168 ms exceeds timeout 120000 ms
23/09/13 21:42:56 ERROR YarnScheduler: Lost executor 2 on sparkde.camp.300123.internal: Executor heartbeat timed out after 120168 ms
23/09/13 21:42:56 WARN TaskSetManager: Lost task 1.0 in stage 3.0 (TID 41) (sparkde.camp.300123.internal executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 120168 ms

[Stage 3:===============>                                          (5 + 1) / 19]

23/09/13 21:42:58 WARN TransportChannelHandler: Exception in connection from /10.172.0.3:41888
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
    at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:258)
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:750)

[Stage 3:========================>                                 (8 + 2) / 19]

23/09/13 21:45:55 ERROR YarnScheduler: Lost executor 3 on sparkde.camp.300123.internal: Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143. 
[2023-09-13 21:45:55.151]Killed by external signal
.
23/09/13 21:45:55 WARN TaskSetManager: Lost task 1.1 in stage 3.0 (TID 47) (sparkde.camp.300123.internal executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143. 
[2023-09-13 21:45:55.151]Killed by external signal
.
23/09/13 21:45:55 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 3 for reason Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143. 
[2023-09-13 21:45:55.151]Killed by external signal

我注意到各个阶段继续进行,最终它停止并给出了所见的错误

[Stage 3:=======================================>                 (13 + 1) / 19]

23/09/13 21:54:46 WARN TaskSetManager: Lost task 14.0 in stage 3.0 (TID 58) (sparkde.camp.300123.internal executor 6): TaskKilled (Stage cancelled)
23/09/13 22:03:05 WARN SparkContext: Executor 1 might already have stopped and can not request thread dump from it.

** 这是 Spark 作业的屏幕截图**

请帮忙

apache-spark hadoop hadoop-yarn
1个回答
0
投票

因此我需要增加分配给 YARN 的内存,因此我需要更新 yarn-site.xml

中的属性值
<property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>21000</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>15048</value>
    </property>

** 所以我只是更新了这些值,之前我只用 1GB 运行:) **

一段时间后我仍然遇到同样的错误,但这一次是来自我的虚拟机,我需要分配更多的 RAM,因为我的配置是 1VCPU 和 8GB RAM

© www.soinside.com 2019 - 2024. All rights reserved.