我有2个数据框
val df1 = Seq((1, "1","6"), (2, "10","8"), (3, "6","4")).toDF("id", "value1","value2")
val df2 = Seq((1, "1","6"), (2, "5","4"), (4, "3","1")).toDF("id", "value1","value2")
并且我想找到列级别的差异输出应该看起来像
id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2
1, 1 ,1 , 0 , 6 ,6 ,0
2, 10 ,5 , 5 , 8 ,4 ,4
3, 6 ,3 , 1 , 4 ,1 ,3
同样,我有100列,并想计算2个数据帧中同一列之间的差异,所以列是动态的
如果我对您的理解正确,那么您想要实现的目标称为两个数据框之间的联接:
val spark = SparkSession.builder
.appName("Simple Application")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
val df1 = spark.createDataFrame(Seq((1, "1", "6"), (2, "10", "8"), (3, "6", "4"))).toDF("id1", "value1", "value2")
val df2 = spark.createDataFrame(Seq((1, "1", "6"), (2, "5", "4"), (4, "3", "1"))).toDF("id1", "value1", "value2")
val d1 = df1.withColumn("id", monotonically_increasing_id())
val d2 = df2.withColumn("id", monotonically_increasing_id())
val res = d1.join(d2, "id")
res.show()
输出:
+---+---+------+------+---+------+------+
| id|id1|value1|value2|id1|value1|value2|
+---+---+------+------+---+------+------+
| 0| 1| 1| 6| 1| 1| 6|
| 1| 2| 10| 8| 2| 5| 4|
| 2| 3| 6| 4| 4| 3| 1|
+---+---+------+------+---+------+------+
关于列名,您应该有一个名称列表,您可以通过它,但是没有足够的信息来构建它。您可以找到here与列名有关的内容