如何计算数组列中的元素?

问题描述 投票:0回答:1

我正在尝试计算以下DataFrame中FavouriteCities列中的元素数量。

+-----------------+
| FavouriteCities |
+-----------------+
|   [NY, Canada]  |
+-----------------+

架构如下:

scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
|    |-- element: string (containsNull = true)

预期的输出应该是这样的,

+------------+-------------+
|  City      |      Count  |
+------------+-------------+
| NY         |      1      |
| Canada     |      1      |
+------------+-------------+

我尝试过使用agg()count()但是如下所示,但它无法从数组中提取单个元素并尝试在列中找到最常见的元素集。

data.agg(count("FavouriteCities").alias("count"))

有人可以指导我吗?

scala apache-spark apache-spark-sql spark-dataframe
1个回答
2
投票

要匹配您显示的架构:

scala> val data = Seq(Tuple1(Array("NY", "Canada"))).toDF("FavouriteCities")
data: org.apache.spark.sql.DataFrame = [FavouriteCities: array<string>]
scala> data.printSchema
root
 |-- FavouriteCities: array (nullable = true)
 |    |-- element: string (containsNull = true)

爆炸:

val counts = data
  .select(explode($"FavouriteCities" as "City"))
  .groupBy("City")
  .count

和聚合:

import spark.implicits._
scala> counts.as[(String, Long)].reduce((a, b) => if (a._2 > b._2) a else b)
res3: (String, Long) = (Canada,1)  
© www.soinside.com 2019 - 2024. All rights reserved.