来自2个数据帧的Spark Scala列级别不匹配

问题描述 投票:0回答:1

我有2个数据框


val df1 = Seq((1, "1","6"), (2, "10","8"), (3, "6","4")).toDF("id", "value1","value2")
val df2 = Seq((1, "1","6"), (2, "5","4"), (4, "3","1")).toDF("id", "value1","value2")

并且我想找到列级别的差异输出应该看起来像

id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2
1, 1        ,1           ,  0         , 6         ,6         ,0
2, 10       ,5           ,  5         , 8         ,4         ,4
3, 6        ,3           ,  1         , 4         ,1         ,3

同样,我有100列,并想计算2个数据帧中同一列之间的差异,所以列是动态的

scala apache-spark difference
1个回答
0
投票

如果我对您的理解正确,那么您想要实现的目标称为两个数据框之间的联接:

val spark = SparkSession.builder
.appName("Simple Application")
.config("spark.master", "local")
.getOrCreate()

import spark.implicits._
import org.apache.spark.sql.functions._

val df1 = spark.createDataFrame(Seq((1, "1", "6"), (2, "10", "8"), (3, "6", "4"))).toDF("id1", "value1", "value2")
val df2 = spark.createDataFrame(Seq((1, "1", "6"), (2, "5", "4"), (4, "3", "1"))).toDF("id1", "value1", "value2")
val d1 = df1.withColumn("id", monotonically_increasing_id())
val d2 = df2.withColumn("id", monotonically_increasing_id())
val res = d1.join(d2, "id")
res.show()

输出:

+---+---+------+------+---+------+------+
| id|id1|value1|value2|id1|value1|value2|
+---+---+------+------+---+------+------+
|  0|  1|     1|     6|  1|     1|     6|
|  1|  2|    10|     8|  2|     5|     4|
|  2|  3|     6|     4|  4|     3|     1|
+---+---+------+------+---+------+------+

关于列名,您应该有一个名称列表,您可以通过它,但是没有足够的信息来构建它。您可以找到here与列名有关的内容

© www.soinside.com 2019 - 2024. All rights reserved.