我创建了rdd并使用以下命令打印结果:
finalRDD = replacetimestampRDD.map(lambda x: (x[1], x[0:]))
print("Partitions structure: {}".format(finalRDD.glom().collect()))
输出(示例):
Partitions structure: [[('a', ['2020-05-22 15:17:10', 'John', '9535175']),
('b', ['2020-05-22 15:17:10', 'Nick', '7383554',]),
('c', ['2020-05-22 15:17:10', 'George', '8915433']),
('a', ['2020-05-22 15:17:10', 'Paul', '9615224'])
]]
我尝试按键对结果进行分组(按键的意思是'a','b','c')。所需的输出:
Partitions structure: [[('a', [['2020-05-22 15:17:10', 'John', '9535175'],['2020-05-22 15:17:10', 'Paul', '9615224']]),
('b', ['2020-05-22 15:17:10', 'Nick', '7383554',]),
('c', ['2020-05-22 15:17:10', 'George', '8915433'])
]]
我尝试使用results = finalRDD.groupByKey().collect()
,但似乎不起作用?
有人可以帮我吗?
您可以在mapValues()
之后使用groupByKey()
创建值的列表:
rdd.groupByKey().mapValues(list).collect()
输出:
[('a',
[['2020-05-22 15:17:10', 'John', '9535175'],
['2020-05-22 15:17:10', 'Paul', '9615224']]),
('b', [['2020-05-22 15:17:10', 'Nick', '7383554']]),
('c', [['2020-05-22 15:17:10', 'George', '8915433']])]