避免在Spark Scala DataFrame中加入

问题描述 投票:1回答:1

我必须基于Azure Blob存储文件夹中的JSON文件进行计算。我正在Azure HDInsight上开发Apache Spark。

此文件夹具有与跟踪顺序相关的数字。如果存在更高的数字,我必须读取此文件夹的JSON并丢弃数字较小的文件夹。例如,如果我有一个名为20200501-1和20200501-2的文件夹,则必须阅读20200501-2。

我在Apache Spark中找到的解决方案是读取路径,并在数据框中添加一列,如下所示:

val visits = session.read.schema(schema).json(pathData).withColumn("path", input_file_name())

通过这条路径,我进行了一些转换。但是,这种转换涉及联接和分组,因此,当我在群集中使用大型数据集运行作业时,Spark作业会花费很多时间。是否可以进行其他转换?或改善我的方法。

我的转换可以像这样处理一个数据框(添加一列之后):

  val visits = Seq(
    ("ITEM4449", 33, "https://[email protected]/20200514-1/somename.json"),
    ("ITEM4450", 16, "https://[email protected]/20200514-1/somename.json"),
    ("ITEM1111", 88, "https://[email protected]/20200514-2/somename.json"),
    ("ITEM4453", 64, "https://[email protected]/20200514-1/somename.json"),
    ("ITEM1111", 12, "https://[email protected]/20200514-1/somename.json")).
    toDF("itemId", "visits", "path")

我进行此转换:

  def discardByTrackingCode(rawDataFrame: DataFrame): DataFrame = {
    val visitWithColumn = rawDataFrame.
      withColumn("tracking_version",
        expr("substring(path, 38, 1)"))
    val itemVersionDf = visitWithColumn.
      withColumn("item_version",
        concat(col("ItemId"), lit("_"), col("tracking_version")))
    val versionToTakeDf = itemVersionDf.
      groupBy(col("ItemId").as("item_id_delete")).
      agg(max("item_version").as("item_version"))
    val itemReport = itemVersionDf.join(versionToTakeDf, Seq("item_version"))
    val finalDf = itemReport.select("ItemId", "Visits", "item_version")
    finalDf
  }

并获得以下正确的数据帧:

+--------+------+------------+
|ItemId  |Visits|item_version|
+--------+------+------------+
|ITEM4449|33    |ITEM4449_1  |
|ITEM4450|16    |ITEM4450_1  |
|ITEM1111|88    |ITEM1111_2  |
|ITEM4453|64    |ITEM4453_1  |
+--------+------+------------+

有一种使该功能起作用的最有效方法吗?除此之外。是否可以(或更好)使用Hadoop FileSystem类查找文件夹?

scala apache-spark hadoop hdinsight
1个回答
1
投票

您可以尝试使用Window表达式:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val window = Window.partitionBy("itemidnumber").orderBy(desc("fileVersion"))

val visits = Seq(
    ("ITEM4449", 33, "https://[email protected]/20200514-1/somename.json"),
    ("ITEM4450", 16, "https://[email protected]/20200514-1/somename.json"),
    ("ITEM1111", 88, "https://[email protected]/20200514-2/somename.json"),
    ("ITEM4453", 64, "https://[email protected]/20200514-1/somename.json"),
    ("ITEM1111", 12, "https://[email protected]/20200514-1/somename.json"))
    .toDF("itemId", "visits", "path")
    .withColumn("itemidnumber", expr("substring(itemId, 5, 4)"))
    .withColumn("fileversion", expr("substring(path, 38, 1)"))
    .withColumn("tracking_version", expr("concat(itemidnumber, substring(path, 38, 1))"))
    .withColumn("row_number", row_number.over(window))
    .filter($"row_number" === 1)    

    display(visits)

输出:

Databricks Community Output

© www.soinside.com 2019 - 2024. All rights reserved.