Spark在S3中创建额外的分区列

Question

我正在将数据帧写入s3，如下所示。目标位置：s3：// test / folder

    val targetDf=spark.read.parquet("targetLocation")
    val df1=spark.sql("select * from sourceDf")
    val df2=spark.sql(select * from targetDf)
/*    
for loop over a date range to dedup and write the data to s3
union dfs and run a dedup logic, have omitted dedup code and for loop
*/
    val df3=spark.sql("select * from df1 union all select * from df2")
    df3.write.partitionBy(data_id, schedule_dt).parquet("targetLocation")

Spark正在写时创建额外的分区列，如下所示：

Exception in thread "main" java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

Partition column name list #0: data_id, schedule_dt
Partition column name list #1: data_id, schedule_dt, schedule_dt

EMR优化器类在编写时启用，我正在使用spark 2.4.3请让我知道可能导致此错误的原因。

谢谢Abhineet

Answer 1

除了分区列，您还应该多分配1列。可以请您尝试

val df3=df1.union(df2)

而不是val df3=spark.sql("select data_id,schedule_dt from df1 union all select data_id,schedule_dt from df2")

Spark在S3中创建额外的分区列

问题描述投票：0回答：1

1个回答

最新问题

Spark在S3中创建额外的分区列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1