我正在学习pySpark。我正在尝试汇总一些数据。下面是我尝试过的代码。
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
spark = SparkSession.builder.appName("learning").master("local[*]").getOrCreate()
path = "deliveries.csv"
text_df = spark.read.csv(path,sep=",", header=True)
temp_df = text_df.withColumn("runs", text_df["batsman_runs"].cast(IntegerType()))
temp_df.show()
temp_df.cache()
print(temp_df.describe())
print(temp_df.dtypes)
temp_df.groupby('batsman').agg(sum('runs')).show()
以下是添加了列“运行”的文件中的数据。
+--------+------+-------------------+--------------------+----+----+------------+------------+-----------+-------------+---------+--------+-----------+-----------+------------+------------+----------+----------+----------------+--------------+-------------+----+
|match_id|inning| batting_team| bowling_team|over|ball| batsman| non_striker| bowler|is_super_over|wide_runs|bye_runs|legbye_runs|noball_runs|penalty_runs|batsman_runs|extra_runs|total_runs|player_dismissed|dismissal_kind| fielder|runs|
+--------+------+-------------------+--------------------+----+----+------------+------------+-----------+-------------+---------+--------+-----------+-----------+------------+------------+----------+----------+----------------+--------------+-------------+----+
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 1| 1| DA Warner| S Dhawan| TS Mills| 0| 0| 0| 0| 0| 0| 0| 0| 0| null| null| null| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 1| 2| DA Warner| S Dhawan| TS Mills| 0| 0| 0| 0| 0| 0| 0| 0| 0| null| null| null| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 1| 3| DA Warner| S Dhawan| TS Mills| 0| 0| 0| 0| 0| 0| 4| 0| 4| null| null| null| 4|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 1| 4| DA Warner| S Dhawan| TS Mills| 0| 0| 0| 0| 0| 0| 0| 0| 0| null| null| null| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 1| 5| DA Warner| S Dhawan| TS Mills| 0| 2| 0| 0| 0| 0| 0| 2| 2| null| null| null| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 1| 6| S Dhawan| DA Warner| TS Mills| 0| 0| 0| 0| 0| 0| 0| 0| 0| null| null| null| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 1| 7| S Dhawan| DA Warner| TS Mills| 0| 0| 0| 1| 0| 0| 0| 1| 1| null| null| null| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 2| 1| S Dhawan| DA Warner|A Choudhary| 0| 0| 0| 0| 0| 0| 1| 0| 1| null| null| null| 1|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 2| 2| DA Warner| S Dhawan|A Choudhary| 0| 0| 0| 0| 0| 0| 4| 0| 4| null| null| null| 4|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 2| 3| DA Warner| S Dhawan|A Choudhary| 0| 0| 0| 0| 1| 0| 0| 1| 1| null| null| null| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 2| 4| DA Warner| S Dhawan|A Choudhary| 0| 0| 0| 0| 0| 0| 6| 0| 6| null| null| null| 6|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 2| 5| DA Warner| S Dhawan|A Choudhary| 0| 0| 0| 0| 0| 0| 0| 0| 0| DA Warner| caught|Mandeep Singh| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 2| 6|MC Henriques| S Dhawan|A Choudhary| 0| 0| 0| 0| 0| 0| 0| 0| 0| null| null| null| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 2| 7|MC Henriques| S Dhawan|A Choudhary| 0| 0| 0| 0| 0| 0| 4| 0| 4| null| null| null| 4|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 3| 1| S Dhawan|MC Henriques| TS Mills| 0| 0| 0| 0| 0| 0| 1| 0| 1| null| null| null| 1|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 3| 2|MC Henriques| S Dhawan| TS Mills| 0| 0| 0| 0| 0| 0| 0| 0| 0| null| null| null| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 3| 3|MC Henriques| S Dhawan| TS Mills| 0| 0| 0| 0| 0| 0| 0| 0| 0| null| null| null| 0|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 3| 4|MC Henriques| S Dhawan| TS Mills| 0| 0| 0| 0| 0| 0| 3| 0| 3| null| null| null| 3|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 3| 5| S Dhawan|MC Henriques| TS Mills| 0| 0| 0| 0| 0| 0| 1| 0| 1| null| null| null| 1|
| 1| 1|Sunrisers Hyderabad|Royal Challengers...| 3| 6|MC Henriques| S Dhawan| TS Mills| 0| 0| 0| 0| 0| 0| 1| 0| 1| null| null| null| 1|
+--------+------+-------------------+--------------------+----+----+------------+------------+-----------+-------------+---------+--------+-----------+-----------+------------+------------+----------+----------+----------------+--------------+-------------+----+
我试图通过runs
得到batsman
分组的总和。但是,我遇到了以下错误。
Traceback (most recent call last):
File "ipl.py", line 19, in <module>
temp_df.groupby('batsman').agg(sum('runs')).show()
TypeError: unsupported operand type(s) for +: 'int' and 'str'
正如显示在列运行中从字符串到Int的数据类型转换一样,我检查了数据框列(describe和dtypes。)>
均显示不同的数据类型。请注意最后一栏。
print(temp_df.describe())
DataFrame[summary: string, match_id: string, inning: string, batting_team: string, bowling_team: string, over: string, ball: string, batsman: string, non_striker: string, bowler: string, is_super_over: string, wide_runs: string, bye_runs: string, legbye_runs: string, noball_runs: string, penalty_runs: string, batsman_runs: string, extra_runs: string, total_runs: string, player_dismissed: string, dismissal_kind: string, fielder: string, runs: string]
print(temp_df.dtypes)
[('match_id', 'string'), ('inning', 'string'), ('batting_team', 'string'), ('bowling_team', 'string'), ('over', 'string'), ('ball', 'string'), ('batsman', 'string'), ('non_striker', 'string'), ('bowler', 'string'), ('is_super_over', 'string'), ('wide_runs', 'string'), ('bye_runs', 'string'), ('legbye_runs', 'string'), ('noball_runs', 'string'), ('penalty_runs', 'string'), ('batsman_runs', 'string'), ('extra_runs', 'string'), ('total_runs', 'string'), ('player_dismissed', 'string'), ('dismissal_kind', 'string'), ('fielder', 'string'), ('runs', 'int')]
为什么数据类型在转换后没有转换?为什么describe
和dtypes
显示不同?
我正在学习pySpark。我正在尝试汇总一些数据。下面是我尝试的代码。从pyspark.sql导入SparkSession从pyspark.sql.types导入IntegerType spark = SparkSession.builder ....
错误原因是您尚未导入求和函数。导入求和函数并尝试,不会出现上述错误。