如何拆分原始数据框并将其合并在一起

问题描述 投票:0回答:1

我有一个包含 230 列和 10 行的数据框(假设是 OriginalDf)。 我需要根据列数(例如 = 150)将其拆分,即 df1 有 150 列,df2 有 80 列。 当我将其合并回来时,列已正确合并,但我看到行数为 20。 我正在使用 scala 和 Spark (3.2.0)。 请提出解决方案。

// Specify the number of columns in each split
    val columnsPerSplit = 100

    // Get the total number of columns in the original DataFrame
    val totalColumns = originalDF.columns.length

    // Calculate the number of splits required
    val numSplits = (totalColumns.toDouble / columnsPerSplit).ceil.toInt

    // Split the original DataFrame into multiple DataFrames based on the specified number of columns
    val splitDataFrames = (0 until numSplits).map { splitIndex =>
      val startColIndex = splitIndex * columnsPerSplit
      val endColIndex = Math.min((splitIndex + 1) * columnsPerSplit, totalColumns)

      // Select columns for the current split
      val selectedColumns = originalDF.columns.slice(startColIndex, endColIndex).map(col)

      // Create a new DataFrame with selected columns
      val splitDataFrame = originalDF.select(selectedColumns: _*)

      splitDataFrame
    }

    // Merge all the splitted transformed dataframes
    val mergedDF = splitDataFrames.reduce((df1, df2) => df1.unionByName(df2, true))```


Expected: Number of rows should remain same both in originalDF and mergedDF

For ex: 

Actual:

df1 = spark.createDataFrame([[0, 1, 2]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[3, 4, 5]], ["col3", "col4", "col5"])
df1.unionByName(df2, allowMissingColumns=True).show()
+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|
+----+----+----+----+----+----+
|   0|   1|   2|NULL|NULL|NULL|
|NULL|NULL|NULL|   3|   4|   5|
+----+----+----+----+----+----+


Expected:

df1 = spark.createDataFrame([[0, 1, 2]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[3, 4, 5]], ["col3", "col4", "col5"])
df1.unionByName(df2, allowMissingColumns=True).show()
+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|
+----+----+----+----+----+----+
|   0|   1|   2|3   |4   |5   |
+----+----+----+----+----+----+
dataframe scala apache-spark apache-spark-sql
1个回答
0
投票

使用

monotonically_increasing_id
函数向实际 DataFrame 添加一个唯一的 id 列。使用此列您可以加入 DataFrame 的后面。

val inputDF = df.withColumn("id", monotonically_increasing_id())
val splitAt = 15 // You can change as per your need.

val splitColumns = inputDF.columns.init.splitAt(splitAt)

val leftColumns  = splitColumns._1 ++ Seq("id") 

// leftColumns are col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9, col_10, col_11, col_12, col_13, col_14, col_15, id

val rightColumns = splitColumns._2 ++ Seq("id") 

// rightColumns are col_16, col_17, col_18, col_19, col_20, col_21, col_22, col_23, id

// Splitting original DataFrame
val leftDF = inputDF.selectExpr(leftColumns: _*) 
val rightDF = inputDF.selectExpr(rightColumns: _*)

// Merging DataFrame's back using join 
val mergedDF = leftDF.join(rightDF, "id", "inner")
inputDF.show(false)
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+---+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|col_11|col_12|col_13|col_14|col_15|col_16|col_17|col_18|col_19|col_20|col_21|col_22|col_23|id |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+---+
|A1   |A2   |A3   |A4   |A5   |A6   |A7   |A8   |A9   |A10   |A11   |A12   |A13   |A14   |A15   |A16   |A17   |A18   |A19   |A20   |A21   |A22   |A23   |0  |
|B1   |B2   |B3   |B4   |B5   |B6   |B7   |B8   |B9   |B10   |B11   |B12   |B13   |B14   |B15   |B16   |B17   |B18   |B19   |B20   |B21   |B22   |B23   |1  |
|C1   |C2   |C3   |C4   |C5   |C6   |C7   |C8   |C9   |C10   |C11   |C12   |C13   |C14   |C15   |C16   |C17   |C18   |C19   |C20   |C21   |C22   |C23   |2  |
|D1   |D2   |D3   |D4   |D5   |D6   |D7   |D8   |D9   |D10   |D11   |D12   |D13   |D14   |D15   |D16   |D17   |D18   |D19   |D20   |D21   |D22   |D23   |3  |
|E1   |E2   |E3   |E4   |E5   |E6   |E7   |E8   |E9   |E10   |E11   |E12   |E13   |E14   |E15   |E16   |E17   |E18   |E19   |E20   |E21   |E22   |E23   |4  |
|F1   |F2   |F3   |F4   |F5   |F6   |F7   |F8   |F9   |F10   |F11   |F12   |F13   |F14   |F15   |F16   |F17   |F18   |F19   |F20   |F21   |F22   |F23   |5  |
|G1   |G2   |G3   |G4   |G5   |G6   |G7   |G8   |G9   |G10   |G11   |G12   |G13   |G14   |G15   |G16   |G17   |G18   |G19   |G20   |G21   |G22   |G23   |6  |
|H1   |H2   |H3   |H4   |H5   |H6   |H7   |H8   |H9   |H10   |H11   |H12   |H13   |H14   |H15   |H16   |H17   |H18   |H19   |H20   |H21   |H22   |H23   |7  |
|I1   |I2   |I3   |I4   |I5   |I6   |I7   |I8   |I9   |I10   |I11   |I12   |I13   |I14   |I15   |I16   |I17   |I18   |I19   |I20   |I21   |I22   |I23   |8  |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+---+
leftDF.show(false)

+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+---+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|col_11|col_12|col_13|col_14|col_15|id |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+---+
|A1   |A2   |A3   |A4   |A5   |A6   |A7   |A8   |A9   |A10   |A11   |A12   |A13   |A14   |A15   |0  |
|B1   |B2   |B3   |B4   |B5   |B6   |B7   |B8   |B9   |B10   |B11   |B12   |B13   |B14   |B15   |1  |
|C1   |C2   |C3   |C4   |C5   |C6   |C7   |C8   |C9   |C10   |C11   |C12   |C13   |C14   |C15   |2  |
|D1   |D2   |D3   |D4   |D5   |D6   |D7   |D8   |D9   |D10   |D11   |D12   |D13   |D14   |D15   |3  |
|E1   |E2   |E3   |E4   |E5   |E6   |E7   |E8   |E9   |E10   |E11   |E12   |E13   |E14   |E15   |4  |
|F1   |F2   |F3   |F4   |F5   |F6   |F7   |F8   |F9   |F10   |F11   |F12   |F13   |F14   |F15   |5  |
|G1   |G2   |G3   |G4   |G5   |G6   |G7   |G8   |G9   |G10   |G11   |G12   |G13   |G14   |G15   |6  |
|H1   |H2   |H3   |H4   |H5   |H6   |H7   |H8   |H9   |H10   |H11   |H12   |H13   |H14   |H15   |7  |
|I1   |I2   |I3   |I4   |I5   |I6   |I7   |I8   |I9   |I10   |I11   |I12   |I13   |I14   |I15   |8  |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+---+
rightDF.show(false)

+------+------+------+------+------+------+------+------+---+
|col_16|col_17|col_18|col_19|col_20|col_21|col_22|col_23|id |
+------+------+------+------+------+------+------+------+---+
|A16   |A17   |A18   |A19   |A20   |A21   |A22   |A23   |0  |
|B16   |B17   |B18   |B19   |B20   |B21   |B22   |B23   |1  |
|C16   |C17   |C18   |C19   |C20   |C21   |C22   |C23   |2  |
|D16   |D17   |D18   |D19   |D20   |D21   |D22   |D23   |3  |
|E16   |E17   |E18   |E19   |E20   |E21   |E22   |E23   |4  |
|F16   |F17   |F18   |F19   |F20   |F21   |F22   |F23   |5  |
|G16   |G17   |G18   |G19   |G20   |G21   |G22   |G23   |6  |
|H16   |H17   |H18   |H19   |H20   |H21   |H22   |H23   |7  |
|I16   |I17   |I18   |I19   |I20   |I21   |I22   |I23   |8  |
+------+------+------+------+------+------+------+------+---+
mergedDF.show(false)
+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|id |col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|col_11|col_12|col_13|col_14|col_15|col_16|col_17|col_18|col_19|col_20|col_21|col_22|col_23|
+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|0  |A1   |A2   |A3   |A4   |A5   |A6   |A7   |A8   |A9   |A10   |A11   |A12   |A13   |A14   |A15   |A16   |A17   |A18   |A19   |A20   |A21   |A22   |A23   |
|1  |B1   |B2   |B3   |B4   |B5   |B6   |B7   |B8   |B9   |B10   |B11   |B12   |B13   |B14   |B15   |B16   |B17   |B18   |B19   |B20   |B21   |B22   |B23   |
|2  |C1   |C2   |C3   |C4   |C5   |C6   |C7   |C8   |C9   |C10   |C11   |C12   |C13   |C14   |C15   |C16   |C17   |C18   |C19   |C20   |C21   |C22   |C23   |
|3  |D1   |D2   |D3   |D4   |D5   |D6   |D7   |D8   |D9   |D10   |D11   |D12   |D13   |D14   |D15   |D16   |D17   |D18   |D19   |D20   |D21   |D22   |D23   |
|4  |E1   |E2   |E3   |E4   |E5   |E6   |E7   |E8   |E9   |E10   |E11   |E12   |E13   |E14   |E15   |E16   |E17   |E18   |E19   |E20   |E21   |E22   |E23   |
|5  |F1   |F2   |F3   |F4   |F5   |F6   |F7   |F8   |F9   |F10   |F11   |F12   |F13   |F14   |F15   |F16   |F17   |F18   |F19   |F20   |F21   |F22   |F23   |
|6  |G1   |G2   |G3   |G4   |G5   |G6   |G7   |G8   |G9   |G10   |G11   |G12   |G13   |G14   |G15   |G16   |G17   |G18   |G19   |G20   |G21   |G22   |G23   |
|7  |H1   |H2   |H3   |H4   |H5   |H6   |H7   |H8   |H9   |H10   |H11   |H12   |H13   |H14   |H15   |H16   |H17   |H18   |H19   |H20   |H21   |H22   |H23   |
|8  |I1   |I2   |I3   |I4   |I5   |I6   |I7   |I8   |I9   |I10   |I11   |I12   |I13   |I14   |I15   |I16   |I17   |I18   |I19   |I20   |I21   |I22   |I23   |
+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
© www.soinside.com 2019 - 2024. All rights reserved.