spark RDD中的分区数

问题描述 投票:1回答:1

我正在通过指定分区数从文本文件创建RDD(Spark 1.6)。但是它给我的分区数量不同于指定的分区数量。

案例1

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at textFile at <console>:27

scala> people.getNumPartitions
res36: Int = 1

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile at <console>:27

scala> people.getNumPartitions
res37: Int = 2

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile at <console>:27

scala> people.getNumPartitions
res38: Int = 3

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:27

scala> people.getNumPartitions
res39: Int = 4

案例2

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at textFile at <console>:27

scala> people.getNumPartitions
res47: Int = 1

案例3

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at textFile at <console>:27

scala> people.getNumPartitions
res40: Int = 6

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at textFile at <console>:27

scala> people.getNumPartitions
res41: Int = 7

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at textFile at <console>:27

scala> people.getNumPartitions
res42: Int = 8

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at textFile at <console>:27

scala> people.getNumPartitions
res43: Int = 9

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at textFile at <console>:27

scala> people.getNumPartitions
res45: Int = 11

案例4

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at textFile at <console>:27

scala> people.getNumPartitions
res44: Int = 11

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at textFile at <console>:27

scala> people.getNumPartitions
res46: Int = 13

文件内容/ home / pvikash / data / test.txt为:

这是一个测试文件。将用于rdd分区

基于上述情况,我有几个问题。

  1. 对于情况2,明确指定的分区数为0,但实际的分区数为1(甚至默认的最小分区数为2,为什么实际的分区数为1?
  2. 对于情况3,为什么在指定的分区数上实际的分区数被+1改变?
  3. 对于情况4,为什么在指定数量的分区上,实际的分区数量增加了+2?
  4. 为什么情况1,情况2,情况3和情况4的火花行为不同?
  5. 如果输入数据的大小很小(可以轻松地放入单个分区中,那么为什么星火会创建空分区?

任何解释将不胜感激。

apache-spark rdd partition
1个回答
0
投票

不是一个完整的答案,但它可能使您更接近它。

您传入的数字称为minSplits。它影响最小分区数,也就是全部。

def textFile(path: String, minSplits: Int = defaultMinSplits): RDD[String]

分割数应由getSplits方法(docs)控制

[此SO post应该回答问题5

© www.soinside.com 2019 - 2024. All rights reserved.