将数据帧写入hdfs的Spark作业被中止FileFormatWriter.scala:196

问题描述 投票:0回答:1

我正在尝试使用以下Spark Scala代码将数据帧存储到HDFS。

数据框中的所有列均可为空= true

    Intermediate_data_final.coalesce(100).write
    .option("header", value = true)
    .option("compression", "bzip2")
    .mode(SaveMode.Append)
    .csv(path)

但是我收到此错误:

2019-08-08T17:22:21.108+0000: [GC (Allocation Failure) [PSYoungGen: 979968K->34277K(1014272K)] 1027111K->169140K(1473536K), 0.0759544 secs] [
Times: user=0.61 sys=0.18, real=0.07 secs] 
2019-08-08T17:22:32.032+0000: [GC (Allocation Failure) [PSYoungGen: 1014245K->34301K(840192K)] 1149108K->263054K(1299456K), 0.0540687 secs] [
Times: user=0.49 sys=0.13, real=0.05 secs] 
Job aborted.
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)

有人可以帮我吗?

scala apache-spark apache-spark-sql file-format
1个回答
0
投票

恐怕不是您的问题的解决方案,但是如果有人使用pyspark解决此问题,那么我设法通过将查询执行和写入执行分离为单独的命令来解决此问题

df.select(foo).filter(bar).map(baz).write.parquet(out_path)

此错误消息将失败(对于3.5GB数据帧),但以下内容工作正常

x = df.select(foo).filter(bar).map(baz)
x.write.parquet(out_path)
© www.soinside.com 2019 - 2024. All rights reserved.