我已从以下边缘输入文件中在pyspark rdd中将邻接表作为[键,值]对生成:
7 10
7 8
7 4
8 9
8 5
9 5
9 10
10 6
4 5
5 6
4 6
1 4
1 3
2 3
2 6
3 4
3 6
rdd1 = sc.textFile("/cc/data/data_cc.txt")
rdd2 = rdd1.map(lambda value : value.split()).flatMap(lambda value : [[value[0] , value[1]],[value[1], value[0]]]).reduceByKey(lambda x,y : x+","+y).map(lambda x :[x[0], x[1].split(",")])
rdd2.collect()
[['4', ['6', '1', '3', '7', '5']], ['1', ['4', '3']], ['10', ['7', '9', '6']], ['8', ['7', '9', '5']], ['9', ['8', '5', '10']], ['7', ['10', '8', '4']], ['5', ['8', '9', '4', '6']], ['6', ['10', '5', '4', '2', '3']], ['3', ['1', '2', '4', '6']], ['2', ['3', '6']]]
现在,我想将这些键值对转换为2跳投影。
例如
对于键='4',该值是列表['6','1','3','7','5']
以获取其2跳投影,我必须将'6'替换为['10','5','4','2','3']
,将'1'替换为['4','3']
,依此类推.. 2跳投影如下:
['4',[['10','5','4','2','3'],['4','3'],['1', '2', '4', '6'],['10', '8', '4'],[['8', '9', '4', '6']]].
类似地,我必须对所有键值对进行操作。
rdd3 = rdd2.flatMap(lambda x : [[(x[0],x[1][k]),x[1]] for k in range(len(x[1]))])
rdd4 = rdd2.flatMap(lambda x : [[(x[1][k],x[0]),x[1]] for k in range(len(x[1]))])
rdd5 = rdd3.join(rdd4)
rdd6 = rdd5.map(lambda x: [x[0][0] , [x[0][1],x[1][1]]]).reduceByKey(lambda x,y : x + y)
这四行很不错。我得到以下输出:
[('1', ['4', ['7', '5', '6', '1', '3'], '3', ['1', '2', '4', '6']]), ('10', ['9', ['8', '5', '10'], '7', ['10', '8', '4'], '6', ['5', '4', '2', '3', '10']]), ('2', ['3', ['1', '2', '4', '6'], '6', ['5', '4', '2', '3', '10']]), ('3', ['1', ['4', '3'], '2', ['3', '6'], '6', ['5', '4', '2', '3', '10'], '4', ['7', '5', '6', '1', '3']]), ('4', ['1', ['4', '3'], '5', ['6', '8', '9', '4'], '6', ['5', '4', '2', '3', '10'], '3', ['1', '2', '4', '6'], '7', ['10', '8', '4']]), ('5', ['4', ['7', '5', '6', '1', '3'], '6', ['5', '4', '2', '3', '10'], '8', ['7', '9', '5'], '9', ['8', '5', '10']]), ('6', ['5', ['6', '8', '9', '4'], '2', ['3', '6'], '3', ['1', '2', '4', '6'], '4', ['7', '5', '6', '1', '3'], '10', ['7', '9', '6']]), ('7', ['8', ['7', '9', '5'], '10', ['7', '9', '6'], '4', ['7', '5', '6', '1', '3']]), ('8', ['7', ['10', '8', '4'], '5', ['6', '8', '9', '4'], '9', ['8', '5', '10']]), ('9', ['10', ['7', '9', '6'], '5', ['6', '8', '9', '4'], '8', ['7', '9', '5']])]