我有一个类似的数据集:
[1 | Goldenrod薰衣草春天巧克力花边|制造商#1 |品牌#13 |促销磨光铜| 7 |巨型PKG | 901.00 | ly。狡猾的讽刺|
[2 |腮红蓝色黄色马鞍|制造商#1 |品牌#13 |大拉丝黄铜| 1 | LG皮套| 902.00 |大型帐户amo |
[3 |春天绿色黄色紫色玉米丝|制造商#4 |品牌#42 |标准抛光黄铜| 21 |包裹盒| 903.00 |卵状沉积物|
[4 |矢车菊巧克力烟熏绿色粉红色|制造商#3 |品牌#34 |小板黄铜| 14 |中鼓| 904.00 | p狂怒r |
我想计算每个品牌的总销售价格。例如Brand#13(901,00 + 913,00 = 1814,009)。
这是我的代码:
from operator import add
import operator
from pyspark.sql import SQLContext
from pyspark.sql import Window
import pyspark.sql.functions
from pyspark import SparkContext, SparkConf
import pyspark
conf = SparkConf().setAppName("part").setMaster("local[*]")
sc = SparkContext(conf = conf)
def Func(lines):
lines = lines.split("|")
return lines[2],lines[3]
def Funcc(lines):
lines = lines.split("|")
return lines[3],lines[7]
text = sc.textFile("part.tbl")
text1 = text.map(Func)
text2 = text.map(Funcc)
sort1 = text1.distinct().sortBy(lambda x:x[0], ascending=True).sortBy(lambda y:y[1], ascending = True)
sort2 = text2.sortBy(lambda x:x[0], ascending=True)
original_text = sort1.collect()
count_by_key = sort2.countByKey()
summe = sort2.reduceByKey(add).collect()
print("Manufacturer and Brands:")
for line in original_text:
print(line)
print("Number of Items of each Brand")
print(count_by_key)
print(summe)
我不允许使用数据框。我尝试过:
summe = sort2.collect()
summe1 = sum(summe[1])
但是该代码不起作用:错误:
summe1 = sum(summe [1])TypeError:+不支持的操作数类型:“ int”和“ str”
我现在有答案:U可以使用简单的reduceByKey函数。 :
preis = sort2.reduceByKey(lambda x,y: x+y).collect()
print("Total sales price of each Brand")
print(preis)