Rdd lambda函数在行与列之间的混淆

Question

我有一个火花RDD（下面有完整代码），我有点困惑。

给出输入数据：

385 | 1
291 | 2

如果我具有下面的lambda函数，为什么在reduceByKey中我们有x [0] + y [0] = 385 + 291？ X和Y肯定与RDD的不同列相关吗？还是我认为这表示他们指的是

totalsByAge = rdd2.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y:(x[0] + y[0], x[1] + y[1]))

完整代码：

import findspark
findspark.init()
import pyspark

#UserID | Name | Age | Num_Friends
#r before the filepath converts it to a raw string
lines = sc.textFile(r"c:\Users\kiera\Downloads\fakefriends.csv") 

#For each line in the file, split it at the comma
#split 2 is the age 
#Split 3 is the number of friends
def splitlines(line):
    fields = line.split(',')
    age = int(fields[2])
    numFriends = int(fields[3])
    return (age, numFriends)

rdd2 = lines.map(splitlines)
totalsByAge = rdd2.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y:(x[0] + y[0], x[1] + y[1]))

rdd2看起来像这样

[(33, 385),
 (26, 2),
 (55, 221),
 (40, 465),
 (68, 21),
 (59, 318),
 (37, 220),
 (54, 307)....

Answer 1

好的，当您执行第一步时：

rdd2 = spark.sparkContext.parallelize([
(33, 385), (26, 2), (55, 221), (40, 465), (68, 21), (59, 318), (37, 220), (54, 307)
])

# Simple count example
# Make a key value pair like ((age, numFriends), 1) 
# Now your key is going to be (age, numFriends) and value is going to be 1
# When you say reduceByKey, it will add up all values for the same key
rdd3  = rdd2.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y)

totalsByAge = rdd2.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y:(x[0] + y[0], x[1] + y[1]))

在上述情况下，您正在做的是：

创建(age, (numFriends, 1))的成对RDD
reduceByKey其中，取x和y并对其执行(x[0] + y[0], x[1] + y[1])。在这里，您的x是RDD的一个元素，而y是另一个（但按年龄分组）]
您按年龄段分组（因为第一个元素是您的关键字，即age），然后将x[0]与y[0]相加，从而将每个年龄段的numFriends相加，然后将x[1]与y[1]相加我们在每个年龄段的第一步mapValues中添加的计数器。

Rdd lambda函数在行与列之间的混淆

问题描述投票：1回答：1

1个回答

最新问题

Rdd lambda函数在行与列之间的混淆

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1