过滤RDD以返回

问题描述 投票:-1回答:1

我的函数(test_rdd.cartesian(test_rdd))返回值对的RDD,如下所示:

[((1, 0), (1, 0)),
 ((1, 0), (2, 0)),
 ((1, 0), (3, 0)),
 ((2, 0), (1, 0)),
 ((2, 0), (2, 0)),
 ((2, 0), (3, 0)),
 ((3, 0), (1, 0)),
 ((3, 0), (2, 0)),
 ((3, 0), (3, 0))]

我需要删除两个元素都相等的条目(例如:...,(((1,0),(1,0)),...)。

当我刚开始使用rdd和spark时,可能会缺少一些非常基本的东西。

你能给我个主意吗?

apache-spark pyspark rdd
1个回答
0
投票

您可以尝试以下Scala代码:

val array = Array(((1, 0), (1, 0)),
 ((1, 0), (2, 0)),
 ((1, 0), (3, 0)),
 ((2, 0), (1, 0)),
 ((2, 0), (2, 0)),
 ((2, 0), (3, 0)),
 ((3, 0), (1, 0)),
 ((3, 0), (2, 0)),
 ((3, 0), (3, 0)))

val rdd = sc.parallelize(array) // creating RDD
val filteredRDD = rdd.filter(row => row._1  != row._2) //accessing element in tuple 
filteredRDD.collect() // calling action

结果:

Array[((Int, Int), (Int, Int))] = Array(((1,0),(2,0)), ((1,0),(3,0)), ((2,0),(1,0)), ((2,0),(3,0)), ((3,0),(1,0)), ((3,0),(2,0)))

对于Pyspark,可以使用以下代码:

array = [((1, 0), (1, 0)), ((1, 0), (2, 0)), ((1, 0), (3, 0)), ((2, 0), (1, 0)), ((2, 0), (2, 0)), ((2, 0), (3, 0)), ((3, 0), (1, 0)), ((3, 0), (2, 0)), ((3, 0), (3, 0))]

rdd = sc.parallelize(array) # creating RDD
filteredRDD = rdd.filter(lambda row : row[0]  != row[1]) #accessing element in tuple 
filteredRDD.collect() # calling action

结果:

[((1, 0), (2, 0)),
 ((1, 0), (3, 0)),
 ((2, 0), (1, 0)),
 ((2, 0), (3, 0)),
 ((3, 0), (1, 0)),
 ((3, 0), (2, 0))]
© www.soinside.com 2019 - 2024. All rights reserved.