使用基于'A','E','I','O','U'元音的scala在Spark(使用RDD的Core)WordCount程序中创建5分区文件

问题描述 投票:1回答:1

使用基于'A', 'E', 'I','O','U'元音的scala在Spark(使用RDD的Core)WordCount程序中创建5分区文件,即每个文件包含每个元音和单词的5个文件。

例如,如果文件abc.txt具有如下文本。

"Apple America Apple Elephant Egg Engine Image Ink Orange Uniform" the 5-part files should be as below

part-0
Apple, 2
America, 1

part-1
Elephant, 1
Egg, 1
Engine, 1

part-2
Image, 1
Ink, 1

part-3
Orange, 1

part-4
Uniform, 1
scala apache-spark rdd
1个回答
3
投票

Gulrez,您可以为此实现自定义Partitioner。例如:

val rdd = spark.sparkContext.parallelize(Seq("Apple America Apple Elephant Egg Engine Image Ink Orange Uniform"))

  val partitioner: Partitioner = new Partitioner {
    override def numPartitions: Int = 5
    override def getPartition(key: Any): Int = key.toString.head match {
      case 'A' => 0
      case 'E' => 1
      case 'I' => 2
      case 'O' => 3
      case 'U' => 4
    }
  }

  val partitionedAndReducedRDD = rdd.flatMap(_.split(" "))
      .groupBy({s: String => s}, partitioner)
      .mapValues(_.size)

  partitionedAndReducedRDD.saveAsTextFile("/tmp/output/")

然后我们得到5个文件(每个分区1个文件),其内容如下:

$ cat /tmp/output/part-00000 
(Apple,2)
(America,1)

$ cat /tmp/output/part-00001
(Elephant,1)
(Engine,1)
(Egg,1)

$ cat /tmp/output/part-00002
(Image,1)
(Ink,1)

$ cat /tmp/output/part-00003
(Orange,1)

$ cat /tmp/output/part-00004
(Uniform,1)
© www.soinside.com 2019 - 2024. All rights reserved.