与空数据框联合问题

Question

我想在循环中将一个数据帧附加到另一个空数据帧，最后写入一个位置。

我的代码-

val myMap = Map(1001 -> "rollNo='12'",1002 -> "rollNo='13'")
val myHiveTableData = spark.table(<table_name>)
val allOtherIngestedData = spark.createDataFrame(sc.emptyRDD[Row],rawDataHiveDf.schema)
myMap.keys.foreach {
                    i => val filteredDataDf = myHiveTableData.where(myMap(i))
                         val othersDf = myHiveTableData.except(filteredDataDf)
                         allOtherIngestedData.union(othersDf)
                         filteredDataDf.write.format("parquer")................... //Writing to a Location in Parquet 
}

allOtherIngestedData.write. ..................... //Writing to a Location in Parquet

但是

allOtherIngestedData

中的数据中没有数据。

如果我这样做

allOtherIngestedData.count

它会给我 ->

Long = 0

。

那么如何追加到空数据框？

这里也可以观察到同样的情况 -

val rawDataHiveDf = spark.table(allInputs("inputHiveTableName"))
val allOthersDf : DataFrame = spark.createDataFrame(sc.emptyRDD[Row],rawDataHiveDf.schema)
allOthersDf.union(rawDataHiveDf)
allOthersDf.count

O/p -

rawDataHiveDf: org.apache.spark.sql.DataFrame = [eventclassversion: string, serialnumber: string ... 33 more fields]
allOthersDf: org.apache.spark.sql.DataFrame = [eventclassversion: string, serialnumber: string ... 33 more fields]
res46: Long = 0

Scala 版本 = 2.11

阿帕奇火花= 2.4.3

Answer 1

在样本 df 上运行良好。

val df = spark.range(2).withColumn("name", lit("foo"))
    df.show(false)
    df.printSchema()
    /**
      * +---+----+
      * |id |name|
      * +---+----+
      * |0  |foo |
      * |1  |foo |
      * +---+----+
      *
      * root
      * |-- id: long (nullable = false)
      * |-- name: string (nullable = false)
      */
    val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row],df.schema)
    emptyDF.show(false)

    /**
      * +---+----+
      * |id |name|
      * +---+----+
      * +---+----+
      */

    emptyDF.unionByName(df)
      .show(false)
    /**
      * +---+----+
      * |id |name|
      * +---+----+
      * |0  |foo |
      * |1  |foo |
      * +---+----+
      */

Answer 2

并集的结果需要存储在单独的DataFrame中。简单地执行联合操作不会更新 allOthersDf。您可以按照以下步骤操作：

val combinedDF = allOthersDf.union(rawDataHiveDf)

或者，如果您希望专门更新

allOthersDf

DataFrame，您可以将其实例化为

var

而不是

val

，然后执行以下操作：

allOthersDf = allOthersDf.union(rawDataHiveDf)

虽然这将像往常一样创建一个新的 DataFrame，但

allOthersDf

现在将引用这个新的 DataFrame。

与空数据框联合问题

问题描述投票：0回答：2

2个回答

最新问题

与空数据框联合问题

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2