PySpark - 通过多个密钥聚合或减少?

问题描述 投票:0回答:1

我有一个带有以下元组格式的RDD:

((a, (b,c)), (d, f, g))

我想通过(a, (b,c))进行分组并将d的总和仅作为:

如何在pySpark中完成多个键的分组,在这种情况下哪个函数更优,reduceByKey或aggregateByKey?

apache-spark pyspark
1个回答
1
投票

在这种情况下,我连接2个字符串,但数字应该相同。我注意到你跳过了“e”值

p=["a","b","c","d", "e","f","g"]
def trasforma(p,num):
     l=list()
     for i in range(0,num):
         l.append([j+str(i) for j in p])
     return l
x=sc.parallelize(trasforma(p,10)+trasforma(p,10)).map(lambda x: ((x[0], (x[1],x[2])), (x[3],x[5],x[6])))
x.reduceByKey(lambda x,y: (x[0]+y[0], x[1], x[2] )).collect()
--------OUTPUT--------

[(('a5', ('b5', 'c5')), ('d5d5', 'f5', 'g5')),
 (('a8', ('b8', 'c8')), ('d8d8', 'f8', 'g8')),
 (('a1', ('b1', 'c1')), ('d1d1', 'f1', 'g1')),
 (('a0', ('b0', 'c0')), ('d0d0', 'f0', 'g0')),
 (('a9', ('b9', 'c9')), ('d9d9', 'f9', 'g9')),
 (('a7', ('b7', 'c7')), ('d7d7', 'f7', 'g7')),
 (('a2', ('b2', 'c2')), ('d2d2', 'f2', 'g2')),
 (('a3', ('b3', 'c3')), ('d3d3', 'f3', 'g3')),
 (('a4', ('b4', 'c4')), ('d4d4', 'f4', 'g4')),
 (('a6', ('b6', 'c6')), ('d6d6', 'f6', 'g6'))]
© www.soinside.com 2019 - 2024. All rights reserved.