我检查了 Stackoverflow 上的其他相关帖子,但似乎没有一个与我的直接相关,因为他们引用的 Spark 版本已经超过 5 年前了而我使用的 Spark 版本是 Spark 3.3.3
我正在运行 Apache Spark 集群,以 Yarn 作为主节点,同时使用 Jupyter Labs 作为 IDE。当我运行命令来启动集群时,它开始使用:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.3'). \
config('spark.ui.port', '0'). \
config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
enableHiveSupport(). \
appName(f'{username} | Python - Kafka and Spark Integration'). \
master('yarn'). \
getOrCreate()
计划是从 HDFS 中的 24 个 json 文件读取流,每个文件大约 80MB,然后写入流,同时将数据按三列分区,并将它们作为 parquet 存储在 HDFS 中的另一个文件夹中。
这是我使用的命令:
file_df. \
writeStream. \
partitionBy('created_year', 'created_month', 'created_dayofmonth'). \
format('parquet'). \
option("checkpointLocation", f"/user/{username}/file_df/streaming/checkpoint/file_df"). \
option("path", f"/user/{username}/file_df/streaming/data/files_parq"). \
trigger(once=True). \
start()
然后当我在笔记本上运行它时得到这个输出
23/09/13 21:34:55 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
<pyspark.sql.streaming.StreamingQuery at 0x7fd22851adc0>
[Stage 3:===============> (5 + 2) / 19]
23/09/13 21:42:56 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 120168 ms exceeds timeout 120000 ms
23/09/13 21:42:56 ERROR YarnScheduler: Lost executor 2 on sparkde.camp.300123.internal: Executor heartbeat timed out after 120168 ms
23/09/13 21:42:56 WARN TaskSetManager: Lost task 1.0 in stage 3.0 (TID 41) (sparkde.camp.300123.internal executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 120168 ms
[Stage 3:===============> (5 + 1) / 19]
23/09/13 21:42:58 WARN TransportChannelHandler: Exception in connection from /10.172.0.3:41888
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:258)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)
[Stage 3:========================> (8 + 2) / 19]
23/09/13 21:45:55 ERROR YarnScheduler: Lost executor 3 on sparkde.camp.300123.internal: Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143.
[2023-09-13 21:45:55.151]Killed by external signal
.
23/09/13 21:45:55 WARN TaskSetManager: Lost task 1.1 in stage 3.0 (TID 47) (sparkde.camp.300123.internal executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143.
[2023-09-13 21:45:55.151]Killed by external signal
.
23/09/13 21:45:55 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 3 for reason Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143.
[2023-09-13 21:45:55.151]Killed by external signal
我注意到各个阶段继续进行,最终它停止并给出了所见的错误:
[Stage 3:=======================================> (13 + 1) / 19]
23/09/13 21:54:46 WARN TaskSetManager: Lost task 14.0 in stage 3.0 (TID 58) (sparkde.camp.300123.internal executor 6): TaskKilled (Stage cancelled)
23/09/13 22:03:05 WARN SparkContext: Executor 1 might already have stopped and can not request thread dump from it.
** 这是 Spark 作业的屏幕截图**
请帮忙
因此我需要增加分配给 YARN 的内存,因此我需要更新 yarn-site.xml
中的属性值<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>21000</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>15048</value>
</property>
** 所以我只是更新了这些值,之前我只用 1GB 运行:) **
一段时间后我仍然遇到同样的错误,但这一次是来自我的虚拟机,我需要分配更多的 RAM,因为我的配置是 1VCPU 和 8GB RAM