如何使用数据集进行分组,但整行

问题描述 投票:1回答:2

阅读this帖子我想知道我们如何将数据集分组但有多列。

喜欢:

val test = Seq(("New York", "Jack", "jdhj"),
    ("Los Angeles", "Tom", "ff"),
    ("Chicago", "David", "ff"),
    ("Houston", "John", "dd"),
    ("Detroit", "Michael", "fff"),
    ("Chicago", "Andrew", "ddd"),
    ("Detroit", "Peter", "dd"),
    ("Detroit", "George", "dkdjkd")
  )

我想得到

芝加哥,[(“大卫”,“ff”),(“安德鲁”,“ddd”)]

apache-spark dataset
2个回答
1
投票

创建一个case类,如下所示

case class TestData (location: String, name: String, value: String)

虚拟数据

val test = Seq(("New York", "Jack", "jdhj"),
    ("Los Angeles", "Tom", "ff"),
    ("Chicago", "David", "ff"),
    ("Houston", "John", "dd"),
    ("Detroit", "Michael", "fff"),
    ("Chicago", "Andrew", "ddd"),
    ("Detroit", "Peter", "dd"),
    ("Detroit", "George", "dkdjkd")
  )
//change each row to TestData object 
    .map(x => TestData(x._1, x._2, x._3))
    .toDS() // create dataset from above data 

根据需要输出

test.groupBy($"location")
    .agg(collect_list(struct("name", "value")).as("data"))
    .show(false)

输出:

+-----------+--------------------------------------------+
|location   |data                                        |
+-----------+--------------------------------------------+
|Los Angeles|[[Tom,ff]]                                  |
|Detroit    |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Chicago    |[[David,ff], [Andrew,ddd]]                  |
|Houston    |[[John,dd]]                                 |
|New York   |[[Jack,jdhj]]                               |
+-----------+--------------------------------------------+

0
投票

我在case class中提出了the link方法,你在问题中提供了这种方式。这是不同的东西。

RDD方式

您可以简单地执行以下操作

val rdd = sc.parallelize(test)      //creating rdd from test
val resultRdd = rdd.groupBy(x => x._1)              //grouping by the first element
  .mapValues(x => x.map(y => (y._2, y._3)))  //collecting the second and third element in the grouped datset

resultRdd.foreach(println)应该给你

(New York,List((Jack,jdhj)))
(Houston,List((John,dd)))
(Chicago,List((David,ff), (Andrew,ddd)))
(Detroit,List((Michael,fff), (Peter,dd), (George,dkdjkd)))
(Los Angeles,List((Tom,ff)))

将rdd转换为dataframe

如果你需要以表格格式输出,你可以在一些操作之后调用.toDF()

val df = resultRdd.map(x => (x._1, x._2.toArray)).toDF()

df.show(false)应该给你

+-----------+--------------------------------------------+
|_1         |_2                                          |
+-----------+--------------------------------------------+
|New York   |[[Jack,jdhj]]                               |
|Houston    |[[John,dd]]                                 |
|Chicago    |[[David,ff], [Andrew,ddd]]                  |
|Detroit    |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Los Angeles|[[Tom,ff]]                                  |
+-----------+--------------------------------------------+
© www.soinside.com 2019 - 2024. All rights reserved.