Spark RDD default number of partitions - scala

Version: Spark 1.6.2, Scala 2.10
I am executing below commands In the spark-shell.
I am trying to see the number of partitions that Spark is creating by default.
val rdd1 = sc.parallelize(1 to 10)
println(rdd1.getNumPartitions) // ==> Result is 4
//Creating rdd for the local file test1.txt. It is not HDFS.
//File content is just one word "Hello"
val rdd2 = sc.textFile("C:/test1.txt")
println(rdd2.getNumPartitions) // ==> Result is 2
As per the Apache Spark documentation, the spark.default.parallelism is the number of cores in my laptop (which is 2 core processor).
My question is : rdd2 seem to be giving the correct result of 2 partitions as said in the documentation. But why rdd1 is giving the result as 4 partitions ?

The minimum number of partitions is actually a lower bound set by the SparkContext. Since spark uses hadoop under the hood, Hadoop InputFormat` will still be the behaviour by default.
The first case should reflect defaultParallelism as mentioned here which may differ, depending on settings and hardware. (Numbers of cores, etc.)
So unless you provide the number of slices, that first case would be defined by the number described by sc.defaultParallelism:
scala> sc.defaultParallelism
res0: Int = 6
scala> sc.parallelize(1 to 100).partitions.size
res1: Int = 6
As for the second case, with sc.textFile, the number of slices by default is the minimum number of partitions.
Which is equal to 2 as you can see in this section of code.
Thus, you should consider the following :
sc.parallelize will take numSlices or defaultParallelism.
sc.textFile will take the maximum between minPartitions and the number of splits computed based on hadoop input split size divided by the block size.
sc.textFile calls sc.hadoopFile, which creates a HadoopRDD that uses InputFormat.getSplits under the hood [Ref. InputFormat documentation].
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException : Logically split the set of input files for the job.
Each InputSplit is then assigned to an individual Mapper for processing.
Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be tuple. Parameters: job - job configuration.
numSplits - the desired number of splits, a hint. Returns: an array of InputSplits for the job. Throws: IOException.
Example:
Let's create some dummy text files:
fallocate -l 241m bigfile.txt
fallocate -l 4G hugefile.txt
This will create 2 files, respectively, of size 241MB and 4GB.
We can see what happens when we read each of the files:
scala> val rdd = sc.textFile("bigfile.txt")
// rdd: org.apache.spark.rdd.RDD[String] = bigfile.txt MapPartitionsRDD[1] at textFile at <console>:27
scala> rdd.getNumPartitions
// res0: Int = 8
scala> val rdd2 = sc.textFile("hugefile.txt")
// rdd2: org.apache.spark.rdd.RDD[String] = hugefile.txt MapPartitionsRDD[3] at textFile at <console>:27
scala> rdd2.getNumPartitions
// res1: Int = 128
Both of them are actually HadoopRDDs:
scala> rdd.toDebugString
// res2: String =
// (8) bigfile.txt MapPartitionsRDD[1] at textFile at <console>:27 []
// | bigfile.txt HadoopRDD[0] at textFile at <console>:27 []
scala> rdd2.toDebugString
// res3: String =
// (128) hugefile.txt MapPartitionsRDD[3] at textFile at <console>:27 []
// | hugefile.txt HadoopRDD[2] at textFile at <console>:27 []

Related

How to check the number of partitions of a Spark DataFrame without incurring the cost of .rdd

There are a number of questions about how to obtain the number of partitions of a n RDD and or a DataFrame : the answers invariably are:
rdd.getNumPartitions
or
df.rdd.getNumPartitions
Unfortunately that is an expensive operation on a DataFrame because the
df.rdd
requires conversion from the DataFrame to an rdd. This is on the order of the time it takes to run
df.count
I am writing logic that optionally repartition's or coalesce's a DataFrame - based on whether the current number of partitions were within a range of acceptable values or instead below or above them.
def repartition(inDf: DataFrame, minPartitions: Option[Int],
maxPartitions: Option[Int]): DataFrame = {
val inputPartitions= inDf.rdd.getNumPartitions // EXPENSIVE!
val outDf = minPartitions.flatMap{ minp =>
if (inputPartitions < minp) {
info(s"Repartition the input from $inputPartitions to $minp partitions..")
Option(inDf.repartition(minp))
} else {
None
}
}.getOrElse( maxPartitions.map{ maxp =>
if (inputPartitions > maxp) {
info(s"Coalesce the input from $inputPartitions to $maxp partitions..")
inDf.coalesce(maxp)
} else inDf
}.getOrElse(inDf))
outDf
}
But we can not afford to incur the cost of the rdd.getNumPartitions for every DataFrame in this manner.
Is there not any way to obtain this information - e.g. from querying the online/temporary catalog for the registered table maybe?
Update The Spark GUI showed the DataFrame.rdd operation as taking as long as the longest sql in the job. I will re-run the job and attach the screenshot in a bit here.
The following is just a testcase : it is using a small fraction of the data size of that in production. The longest sql is only five minutes - and this one is on its way to spending that amount of time as well (note that the sql is not helped out here: it also has to execute subsequently thus effectively doubling the cumulative execution time).
We can see that the .rdd operation at DataFrameUtils line 30 (shown in the snippet above) takes 5.1mins - and yet the save operation still took 5.2 mins later -i.e. we did not save any time by doing the .rdd in terms of the execution time of the subsequent save.
There is no inherent cost of rdd component in rdd.getNumPartitions, because returned RDD is never evaluated.
While you can easily determine this empirically, using debugger (I'll leave this as an exercise for the reader), or establishing that no jobs are triggered in the base case scenario
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val ds = spark.read.text("README.md")
ds: org.apache.spark.sql.DataFrame = [value: string]
scala> ds.rdd.getNumPartitions
res0: Int = 1
scala> spark.sparkContext.statusTracker.getJobIdsForGroup(null).isEmpty // Check if there are any known jobs
res1: Boolean = true
it might be not enough to convince you. So let's approach this in a more systematic way:
rdd returns a MapPartitionRDD (ds as defined above):
scala> ds.rdd.getClass
res2: Class[_ <: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]] = class org.apache.spark.rdd.MapPartitionsRDD
RDD.getNumPartitions invokes RDD.partitions.
In non-checkpointed scenario RDD.partitions invokes getPartitions (feel free to trace the checkpoint path as well).
RDD.getPartitions is abstract.
So the actual implementation used in this case is MapPartitionsRDD.getPartitions, which simply delegates the call to the parent.
There are only MapPartitionsRDD between rdd and the source.
scala> ds.rdd.toDebugString
res3: String =
(1) MapPartitionsRDD[3] at rdd at <console>:26 []
| MapPartitionsRDD[2] at rdd at <console>:26 []
| MapPartitionsRDD[1] at rdd at <console>:26 []
| FileScanRDD[0] at rdd at <console>:26 []
Similarly if Dataset contained an exchange we would follow the parents to the nearest shuffle:
scala> ds.orderBy("value").rdd.toDebugString
res4: String =
(67) MapPartitionsRDD[13] at rdd at <console>:26 []
| MapPartitionsRDD[12] at rdd at <console>:26 []
| MapPartitionsRDD[11] at rdd at <console>:26 []
| ShuffledRowRDD[10] at rdd at <console>:26 []
+-(1) MapPartitionsRDD[9] at rdd at <console>:26 []
| MapPartitionsRDD[5] at rdd at <console>:26 []
| FileScanRDD[4] at rdd at <console>:26 []
Note that this case is particularly interesting, because we actually triggered a job:
scala> spark.sparkContext.statusTracker.getJobIdsForGroup(null).isEmpty
res5: Boolean = false
scala> spark.sparkContext.statusTracker.getJobIdsForGroup(null)
res6: Array[Int] = Array(0)
That's because we've encountered as scenario where the partitions cannot be determined statically (see Number of dataframe partitions after sorting? and Why does sortBy transformation trigger a Spark job?).
In such scenario getNumPartitions will also trigger a job:
scala> ds.orderBy("value").rdd.getNumPartitions
res7: Int = 67
scala> spark.sparkContext.statusTracker.getJobIdsForGroup(null) // Note new job id
res8: Array[Int] = Array(1, 0)
however it doesn't mean that the observed cost is somehow related to .rdd call. Instead it is an intrinsic cost of finding partitions in case, where there is no static formula (some Hadoop input formats for example, where full scan of the data is required).
Please note that the points made here shouldn't be extrapolated to other applications of Dataset.rdd. For example ds.rdd.count would be indeed expensive and wasteful.
In my experience df.rdd.getNumPartitions is very fast, I never encountered taking this more than a second or so.
Alternatively, you could also try
val numPartitions: Long = df
.select(org.apache.spark.sql.functions.spark_partition_id()).distinct().count()
which would avoid using .rdd

Why does Spark increment the RDD ID by 2 instead of 1 when reading in text files?

I noticed something interesting when working with the spark-shell and I'm curious as to why this is happening. I load a text file into Spark using the basic syntax, and then I just simply repeat this command. The output of the REPL is below:
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[3] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[5] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[7] at textFile at <console>:24
I know that the MapPartitionsRDD[X] portion features X as the RDD identifier. However, based upon this SO post on RDD identifiers, I'd expect that the identifier integer increments by one each time a new RDD is created. So why exactly is it incrementing by 2?
My guess is that loading a text file creates an intermediate RDD? Because clearly creating an RDD from parallelize() only increments the RDD counter by 1 (before it was 7):
scala> val arrayrdd = sc.parallelize(Array(3,4,5))
arrayrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
Note: I don't believe the number has anything to do w/ partitions. If I call, I get that my RDD is partitioned into 9 partitions:
scala> myreviews.partitions.size
res2: Int = 9
Because a single method call can create more than one intermediate RDD. It will be obvious if you check the debug string
sc.textFile("README.md").toDebugString
String =
(2) README.md MapPartitionsRDD[1] at textFile at <console>:25 []
| README.md HadoopRDD[0] at textFile at <console>:25 []
As you see the lineage consist of two RDDs.
The first one is a HadoopRDD which corresponds to data import
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions)
The second one is MapPartitionsRDD and corresponds to the subsequent map which drops keys (offsets) and converts Text to String.
.map(pair => pair._2.toString).setName(path)

In Apache Spark cogroup, how to make sure 1 RDD of >2 operands is not moved?

In a cogroup transformation, e.g. RDD1.cogroup(RDD2, ...), I used to assume that Spark only shuffles/moves RDD2 and retains RDD1's partitioning and in-memory storage if:
RDD1 has an explicit partitioner
RDD1 is cached.
In my other projects most of the shuffling behaviour seems to be consistent with this assumption. So yesterday I wrote a short scala program to prove it once and for all:
// sc is the SparkContext
val rdd1 = sc.parallelize(1 to 10, 4).map(v => v->v)
.partitionBy(new HashPartitioner(4))
rdd1.persist().count()
val rdd2 = sc.parallelize(1 to 10, 4).map(v => (11-v)->v)
val cogrouped = rdd1.cogroup(rdd2).map {
v =>
v._2._1.head -> v._2._2.head
}
val zipped = cogrouped.zipPartitions(rdd1, rdd2) {
(itr1, itr2, itr3) =>
itr1.zipAll(itr2.map(_._2), 0->0, 0).zipAll(itr3.map(_._2), (0->0)->0, 0)
.map {
v =>
(v._1._1._1, v._1._1._2, v._1._2, v._2)
}
}
zipped.collect().foreach(println)
If rdd1 doesn't move the first column of zipped should have the same value as the third column, so I ran the programs, oops:
(4,7,4,1)
(8,3,8,2)
(1,10,1,3)
(9,2,5,4)
(5,6,9,5)
(6,5,2,6)
(10,1,6,7)
(2,9,10,0)
(3,8,3,8)
(7,4,7,9)
(0,0,0,10)
The assumption is not true. Spark probably did some internal optimisation and decided that regenerating rdd1's partitions is much faster than keeping them in cache.
So the question is: If my programmatic requirement to not move RDD1 (and keep it cached) is because of other reasons than speed (e.g. resource locality), or in some occasions Spark internal optimisation is not preferrable, is there a way to explicitly instruct the framework to not move an operand in all cogroup-like operations? This also include join, outer join, and groupWith.
Thanks a lot for your help. So far I'm using broadcast join as a not-so-scalable makeshift solution, it is not going to last long before crashing my cluster. I'm expecting a solution consistent with the distributed computing principal.
If rdd1 doesn't move the first column of zipped should have the same value as the third column
This assumption is just incorrect. Creating CoGroupedRDD is not only about shuffle, but also about generating internal structures required for matching corresponding records. Internally Spark will use its own ExternalAppendOnlyMap which uses custom open hash table implementation (AppendOnlyMap) which doesn't provide any ordering guarantees.
If you check debug string:
zipped.toDebugString
(4) ZippedPartitionsRDD3[8] at zipPartitions at <console>:36 []
| MapPartitionsRDD[7] at map at <console>:31 []
| MapPartitionsRDD[6] at cogroup at <console>:31 []
| CoGroupedRDD[5] at cogroup at <console>:31 []
| ShuffledRDD[2] at partitionBy at <console>:27 []
| CachedPartitions: 4; MemorySize: 512.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
+-(4) MapPartitionsRDD[1] at map at <console>:26 []
| ParallelCollectionRDD[0] at parallelize at <console>:26 []
+-(4) MapPartitionsRDD[4] at map at <console>:29 []
| ParallelCollectionRDD[3] at parallelize at <console>:29 []
| ShuffledRDD[2] at partitionBy at <console>:27 []
| CachedPartitions: 4; MemorySize: 512.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
+-(4) MapPartitionsRDD[1]...
you'll see that Spark indeed uses CachedPartitions to compute zipped RDD. If you also skip map transformations, which removes partitioner, you'll see that coGroup reuses partitioner provided by rdd1:
rdd1.cogroup(rdd2).partitioner == rdd1.partitioner
Boolean = true

How to generate key-value format using Scala in Spark

I am studying Spark on VirtualBox. I use ./bin/spark-shell to open Spark and use Scala. Now I got confused about key-value format using Scala.
I have a txt file in home/feng/spark/data, which looks like:
panda 0
pink 3
pirate 3
panda 1
pink 4
I use sc.textFile to get this txt file. If I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7")
Then I can use rdd.collect() to show rdd on the screen:
scala> rdd.collect()
res26: Array[String] = Array(panda 0, pink 3, pirate 3, panda 1, pink 4)
However, if I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7.txt")
which no ".txt" here. Then when I use rdd.collect(), I got a mistake:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/feng/spark/A.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
......
But I saw other examples. All of them have ".txt" at the end. Is there something wrong with my code or my system?
Another thing is when I tried to do:
scala> val rddd = rdd.map(x => (x.split(" ")(0),x))
rddd: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[2] at map at <console>:29
scala> rddd.collect()
res0: Array[(String, String)] = Array((panda,panda 0), (pink,pink 3), (pirate,pirate 3), (panda,panda 1), (pink,pink 4))
I intended to select the first column of the data and use it as the key. But rddd.collect() looks like not that way as the words occur twice, which is not right. I cannot keep doing the rest operations like mapbykey, reducebykey or others. Where did I do wrong?
Just for example I create a String with your dataset, after this I split the record by line, and use SparkContext's parallelize method to create an RDD. Notice that after I create the RDD I use its map method to split the String stored in each record and convert it to a Row.
import org.apache.spark.sql.Row
val text = "panda 0\npink 3\npirate 3\npanda 1\npink 4"
val rdd = sc.parallelize(text.split("\n")).map(x => Row(x.split(" "):_*))
rdd.take(3)
The output from the take method is:
res4: Array[org.apache.spark.sql.Row] = Array([panda,0], [pink,3], [pirate,3])
About your first question, there is no need for files to have any extension. Because, in this case files are seen as plain text.

Spark processing columns in parallel

I've been playing with Spark, and I managed to get it to crunch my data. My data consists of flat delimited text file, consisting of 50 columns and about 20 millions of rows. I have scala scripts that will process each column.
In terms of parallel processing, I know that RDD operation run on multiple nodes. So, every time I process a column, they are processed in parallel, but the column itself is processed sequentially.
A simple example: if my data is 5 column text delimited file and each column contain text, and I want to do word count for each column. I would do:
for(i <- 0 until 4){
data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_)
}
Although each column's operation is run in parallel, the column itself is processed sequentially(bad wording I know. Sorry!). In other words, column 2 is processed after column 1 is done. Column 3 is processed after column 1 and 2 are done, and so on.
My question is: Is there anyway to process multiple column at a time? If you know a way, cor a tutorial, would you mind sharing it with me?
thank you!!
Suppose the inputs are seq. Following can be done to process columns concurrently. The basic idea is to using sequence (column, input) as the key.
scala> val rdd = sc.parallelize((1 to 4).map(x=>Seq("x_0", "x_1", "x_2", "x_3")))
rdd: org.apache.spark.rdd.RDD[Seq[String]] = ParallelCollectionRDD[26] at parallelize at <console>:12
scala> val rdd1 = rdd.flatMap{x=>{(0 to x.size - 1).map(idx=>(idx, x(idx)))}}
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = FlatMappedRDD[27] at flatMap at <console>:14
scala> val rdd2 = rdd1.map(x=>(x, 1))
rdd2: org.apache.spark.rdd.RDD[((Int, String), Int)] = MappedRDD[28] at map at <console>:16
scala> val rdd3 = rdd2.reduceByKey(_+_)
rdd3: org.apache.spark.rdd.RDD[((Int, String), Int)] = ShuffledRDD[29] at reduceByKey at <console>:18
scala> rdd3.take(4)
res22: Array[((Int, String), Int)] = Array(((0,x_0),4), ((3,x_3),4), ((2,x_2),4), ((1,x_1),4))
The example output: ((0, x_0), 4) means the first column, key is x_0, and value is 4. You can start from here to process further.
You can try the following code, which use the scala parallize collection feature,
(0 until 4).map(index => (index,data)).par.map(x => {
x._2.map(_.split("\t",-1)(x._1)).map((_,1)).reduce(_+_)
}
data is a reference, so duplicate the data will not cost to much. And rdd is read-only, so parallelly processing can work. The par method use the parallely collection feature. You can check the parallel jobs on the spark web UI.