Python Spark:如何为每个元组连接包含> 2个元素的2个数据集

问题描述 投票:0回答:1

我正在尝试根据常见的“库存”键来连接这两个数据集中的数据

stock, sector
GOOG Tech

stock, date, volume
GOOG 2015 5759725

join方法应该将它们连接在一起,但是我得到的RDD格式如下:

GOOG, (Tech, 2015)

我想要获得:

(Tech, 2015) 5759726

另外,我如何通过密钥减少结果(例如(Tech,2015))以获得每个部门和年份的数字总和?

非常感谢提前!

apache-spark pyspark
1个回答
0
投票

希望这可以帮助!

from pyspark.sql.functions import struct, col, sum

#sample data
df1 = sc.parallelize([['GOOG', 'Tech'],
                      ['AAPL', 'Tech'],
                      ['XOM', 'Oil']]).toDF(["stock","sector"])
df2 = sc.parallelize([['GOOG', '2015', '5759725'],
                      ['AAPL', '2015', '123'],
                      ['XOM',  '2015', '234'],
                      ['XOM',  '2016', '789']]).toDF(["stock","date","volume"])

#final output
df = df1.join(df2, ['stock'], 'inner').\
    withColumn('sector_year', struct(col('sector'), col('date'))).\
    drop('stock','sector','date')
df.show()

#numerical summation for each sector and year
df.groupBy('sector_year').agg(sum('volume')).show()

输出是:

+-------+-----------+
| volume|sector_year|
+-------+-----------+
|    123|[Tech,2015]|
|    234| [Oil,2015]|
|    789| [Oil,2016]|
|5759725|[Tech,2015]|
+-------+-----------+

+-----------+-----------+
|sector_year|sum(volume)|
+-----------+-----------+
|[Tech,2015]|  5759848.0|
| [Oil,2015]|      234.0|
| [Oil,2016]|      789.0|
+-----------+-----------+
© www.soinside.com 2019 - 2024. All rights reserved.