使用基于'A', 'E', 'I','O','U'
元音的scala在Spark(使用RDD的Core)WordCount程序中创建5分区文件,即每个文件包含每个元音和单词的5个文件。
例如,如果文件abc.txt具有如下文本。
"Apple America Apple Elephant Egg Engine Image Ink Orange Uniform" the 5-part files should be as below
part-0
Apple, 2
America, 1
part-1
Elephant, 1
Egg, 1
Engine, 1
part-2
Image, 1
Ink, 1
part-3
Orange, 1
part-4
Uniform, 1
Gulrez,您可以为此实现自定义Partitioner
。例如:
val rdd = spark.sparkContext.parallelize(Seq("Apple America Apple Elephant Egg Engine Image Ink Orange Uniform"))
val partitioner: Partitioner = new Partitioner {
override def numPartitions: Int = 5
override def getPartition(key: Any): Int = key.toString.head match {
case 'A' => 0
case 'E' => 1
case 'I' => 2
case 'O' => 3
case 'U' => 4
}
}
val partitionedAndReducedRDD = rdd.flatMap(_.split(" "))
.groupBy({s: String => s}, partitioner)
.mapValues(_.size)
partitionedAndReducedRDD.saveAsTextFile("/tmp/output/")
然后我们得到5个文件(每个分区1个文件),其内容如下:
$ cat /tmp/output/part-00000
(Apple,2)
(America,1)
$ cat /tmp/output/part-00001
(Elephant,1)
(Engine,1)
(Egg,1)
$ cat /tmp/output/part-00002
(Image,1)
(Ink,1)
$ cat /tmp/output/part-00003
(Orange,1)
$ cat /tmp/output/part-00004
(Uniform,1)