如何计算Pyspark数据框中的元素

Question

我有一个pyspark数据框。这是电影数据集。一栏是被“ |”分割的流派。每部电影都有多种流派。

genres = spark.sql("SELECT DISTINCT genres FROM movies ORDER BY genres ASC")
genres.show(5)

[我想计算每个流派有多少部电影。我也想展示那些电影。就像下面这样：我应该怎么做？

Answer 1

这是一种方法：

# sample data
d = [('Action',), ('Action|Adventure',), ('Action|Adventure|Drama',)]
df = spark.createDataFrame(d, ['genres',])

# create count
agg_df = (df
          .rdd
          .map(lambda x: x.genres.split('|')) # gives nested list
          .flatMap(lambda x: x) # flatten the list
          .map(lambda x: (x,)) # convert to tuples
          .toDF(['genres'])
          .groupby('genres')
          .count())

agg_df.show()

+---------+-----+
|   genres|count|
+---------+-----+
|Adventure|    2|
|    Drama|    1|
|   Action|    3|
+---------+-----+

Answer 2

这里是仅使用DataFrame API的方法。首先，使用split函数拆分genres字符串，然后使用explode结果数组和groupBy genres进行计数：

data = [["Action"], ["Action|Adventure|Thriller"], ["Action|Adventure|Drama"]]
df = spark.createDataFrame(data, ["genres"])

df = df.withColumn("genres", explode(split(col("genres"), "[|]"))) \
    .groupBy("genres").count()

df.show()

给予：

+---------+-----+
|   genres|count|
+---------+-----+
| Thriller|    1|
|Adventure|    2|
|    Drama|    1|
|   Action|    3|
+---------+-----+

Answer 3

用途：

import pyspark.sql.functions as f
df.groupby("generes").agg(f.collect_set("Category"),f.count("Category")).show()

这将获得所需的输出。

如何计算Pyspark数据框中的元素

问题描述投票：0回答：3

3个回答

最新问题

如何计算Pyspark数据框中的元素

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3