这里是一个问题,
有名字列表
[
["John", "5"]
["Bill", "7"]
["Bill", "7,8"]
["Harry", "0, 1,2"]
["Harry", "2,3"]
["Harry", "3,6"]
["Harry", "4"]
]
它需要连接具有共同数字的字符串,使用spark scala或java map reduce生成以下结果,
[
["John", "5"]
["Bill", "7,8"]
["Harry", "0,1,2,3,6"]
["Harry", "4"]
]
有没有很好的方法来解决这种问题。
谢谢!
reduceByKey
组合键相同的列表,然后再次映射以删除重复项,然后将其返回为字符串。与rdd:
rdd
.map(a => (a._1, a._2.split(',')))
.reduceByKey(_ ++ _)
.map(a => (a._1, a._2.distinct.mkString(",")))
使用数据框,因为没有很好的方法来对groupBy和concat列表进行分组,但这会产生结果带有数据框
dataframe .withColumn("value", split(col("value"), ",") .cast("array<integer>")) .withColumn("value", explode(col("value"))) .groupBy(col("name")) .agg(collect_set(col("value")) as "value") .withColumn("value", concat_ws(",",col("value")))
对于两种解决方案:输入:
+-----+-----+ | name|value| +-----+-----+ | John| 5| | Bill| 7| | Bill| 7,8| |Harry|0,1,2| |Harry| 2,3| |Harry| 3,6| |Harry| 4| +-----+-----+
输出:
+-----+-----------+ | name| value| +-----+-----------+ | Bill| 7,8| | John| 5| |Harry|0,1,2,6,3,4| +-----+-----------+