Optimal number of partitions in a grouped PairRDD in Spark - scala

I have two pair RDDs with the structure RDD[String, Int], called rdd1 and rdd2.
Each of these RDDs is groupped by its key, and I want to execute a function over its values (so I will use mapValues method).
Does the method "GroupByKey" creates a new partition for each key or have I to specify this manually using "partitionBy"?
I understand that the partitions of a RDD won't change if I don't perform operations that change the key, so if I perform a mapValues operation on each RDD or if I perform a join operation between the previous two RDDs, the partitions of the resulting RDD won't change. Is it true?
Here we have a code example. Notice that "function" is not defined because it is not important here.
val lvl1rdd=rdd1.groupByKey()
val lvl2rdd=rdd2.groupByKey()
val lvl1_lvl2=lvl1rdd.join(lvl2rdd)
val finalrdd=lvl1_lvl2.mapValues(value => function(value))
If I join the previous RDDs and I execute a function over the values of the resulting RDD (mapValues), all the work is being done in a single worker instead of distributing the different tasks over the different workers nodes of the cluster. I mean, the desired behaviour should be to execute, in parallel, the function passed as a parameter to the mapValues method in so many nodes as the cluster allows us.

1) Avoid groupByKey operations as they act as bottleneck for network I/O and execution performance.
Prefer reduceByKey Operation in this case as the data shuffle is comparatively less than groupByKey and we can witness the difference much better if it is a larger Dataset.
val lvl1rdd = rdd1.reduceByKey(x => function(x))
val lvl1rdd = rdd2.reduceByKey(x => function(x))
//perform the Join Operation on these resultant RDD's
Application of function on RDD's seperately and joining them is far better than joining RDD's and applying a function using groupByKey()
This will also ensure the tasks get distributed among different executors and execute in parallel
Refer this link
2) The underlying partitioning technique is Hash partitioner. If we assume that our data is located in n number of partitions initially then groupByKey Operation will follow Hash mechanism.
partition = key.hashCode() % numPartitions
This will create fixed number of partitions which can be more than intial number when you use the groupByKey Operation.we can also customize the partitions to be made. For example
val result_rdd = rdd1.partitionBy(new HashPartitioner(2))
This will create 2 partitions and in this way we can set the number of partitions.
For deciding the optimal number of partitions refer this answer https://stackoverflow.com/a/40866286/7449292

Related

Memory efficient way to repartition a large dataset by key and applying a function separately for each group batch-by-batch

I have a large spark scala Dataset with a "groupName" column. Data records are spread along different partitions. I want to group records together by "groupName", collect batch-by-batch and apply a function on entire batch.
By "batch" I mean a predefined number of records (let's call it maxBatchCount) of the same group. By "batch-by-batch" I mean I want to use memory efficiently and not collect all partition to memory.
To be more specific, the batch function includes serialization, compression and encryption of the entire batch. This is later transformed into another dataset to be written to hdfs using partitionBy("groupName"). Therefore I can't avoid a full shuffling.
Is there a simple way for doing this? I made some attempt described below but TL/DR it seemed a bit over complicated and it eventually failed on Java memory issues.
Details
I tried to use a combination of repartition("groupName"), mapPartitions and Iterator's grouped(maxBatchCount) method which seemed very fit to the task. However, the repartitioning only makes sure records of the same groupName will be in the same partition, but a single partition might have records from several different groupName (if #groups > #partitions) and they can be scattered around inside the partition. So now I still need to do some grouping inside each partition first. The problem is that from mapPartition I get an Iterator which doesn't seem to have such API and I don't want to collect all data to memory.
Then I tried to enhance the above solution with Iterator's partition method. The idea is to first iterate the complete partition for building a Set of all the present groups and then use Iterator.partition to build a separate iterator for each of the present groups. And then use grouped as before.
It goes something like this - for illustration I used a simple case class of two Ints, and groupName is actually mod3 column, created by applying modulo 3 function for each number in the Range:
case class Mod3(number: Int, mod3: Int)
val maxBatchCount = 5
val df = spark.sparkContext.parallelize(Range(1,21))
.toDF("number").withColumn("mod3", col("number") % 3)
// here I choose #partitions < #groups for illustration
val dff = df.repartition(1, col("mod3"))
val dsArr = dff.as[Mod3].mapPartitions(partitionIt => {
// we'll need 2 iterations
val (it1, it2) = partitionIt.duplicate
// first iterate to create a Set of all present groups
val mod3set = it1.map(_.mod3).toSet
// build partitioned iterators map (one for each group present)
var it: Iterator[Mod3] = it2 // init var
val itMap = mod3set.map(mod3val => {
val (filteredIt, residueIt) = it.partition(_.mod3 == mod3val)
val pair = (mod3val -> filteredIt)
it = residueIt
pair
}).toMap
mod3set.flatMap(mod3val => {
itMap(mod3val).grouped(maxBatchCount).map(grp => {
val batch = grp.toList
batch.map(_.number).toArray[Int] // imagine some other batch function
})
}).toIterator
}).as[Array[Int]]
val dsArrCollect = dsArr.collect
dsArrCollect.map(_.toList).foreach(println)
This seemed to work nicely when testing with small data, but when running with actual data (on an actual spark cluster with 20 executors, 2 cores each) I received java.lang.OutOfMemoryError: GC overhead limit exceeded
Note in my actual data groups sizes are highly skewed and one of the groups is about the size of all the rest of the groups combined (I guess the GC memory issue is related to that group). Because of this I also tried to combine a secondary neutral column in repartition but it didn't help.
Will appreciate any pointers here,
Thanks!
I think you have the right approach with the repartition + map partitions.
The problem is that your map partition function ends up loading the entire partitions in memory.
First solution could be to increase the number of partitions and thus reduce the number of groups/ data in a partitions.
Another solution would be to use partitionIt.flatMap and process 1 record at time , accumulating only at most 1 group data
Use sortWithinPartitions so that records from the same group are consecutive
in the flatMap function, accumulate your data and keep track of group changes.

Why does RDD.foreach fail with "SparkException: This RDD lacks a SparkContext"?

I have a dataset (as an RDD) that I divide into 4 RDDs by using different filter operators.
val RSet = datasetRdd.
flatMap(x => RSetForAttr(x, alLevel, hieDict)).
map(x => (x, 1)).
reduceByKey((x, y) => x + y)
val Rp:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rp"))
val Rc:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rc"))
val RpSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RpSv"))
val RcSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RcSv"))
I sent Rp and RpSV to the following function calculateEntropy:
def calculateEntropy(Rx: RDD[(String, Int)], RxSv: RDD[(String, Int)]): Map[Int, Map[String, Double]] = {
RxSv.foreach{item => {
val string = item._1.split(",")
val t = Rx.filter(x => x._1.split(",")(2).equals(string(2)))
.
.
}
}
I have two questions:
1- When I loop operation on RxSv as:
RxSv.foreach{item=> { ... }}
it collects all items of the partitions, but I want to only a partition where i am in. If you said that user map function but I don't change anything on RDD.
So when I run the code on a cluster with 4 workers and a driver the dataset is divided into 4 partitions and each worker runs the code. But for example i use foreach loop as i specified in the code. Driver collects all data from workers.
2- I have encountered with a problem on this code
val t = Rx.filter(x => x._1.split(",")(2).equals(abc(2)))
The error :
org.apache.spark.SparkException: This RDD lacks a SparkContext.
It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations;
for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
First of all, I'd highly recommend caching the first RDD using cache operator.
RSet.cache
That will avoid scanning and transforming your dataset every time you filter for the other RDDs: Rp, Rc, RpSv and RcSv.
Quoting the scaladoc of cache:
cache() Persist this RDD with the default storage level (MEMORY_ONLY).
Performance should increase.
Secondly, I'd be very careful using the term "partition" to refer to a filtered RDD since the term has a special meaning in Spark.
Partitions say how many tasks Spark executes for an action. They are hints for Spark so you, a Spark developer, could fine-tune your distributed pipeline.
The pipeline is distributed across cluster nodes with one or many Spark executors per the partitioning scheme. If you decide to have a one partition in a RDD, once you execute an action on that RDD, you'll have one task on one executor.
The filter transformation does not change the number of partitions (in other words, it preserves partitioning). The number of partitions, i.e. the number of tasks, is exactly the number of partitions of RSet.
1- When I loop operation on RxSv it collects all items of the partitions, but I want to only a partition where i am in
You are. Don't worry about it as Spark will execute the task on executors where the data lives. foreach is an action that does not collect items but describes a computation that runs on executors with the data distributed across the cluster (as partitions).
If you want to process all items at once per partition use foreachPartition:
foreachPartition Applies a function f to each partition of this RDD.
2- I have encountered with a problem on this code
In the following lines of the code:
RxSv.foreach{item => {
val string = item._1.split(",")
val t = Rx.filter(x => x._1.split(",")(2).equals(string(2)))
you are executing foreach action that in turn uses Rx which is RDD[(String, Int)]. This is not allowed (and if it were possible should not have been compiled).
The reason for the behaviour is that an RDD is a data structure that just describes what happens with the dataset when an action is executed and lives on the driver (the orchestrator). The driver uses the data structure to track the data sources, transformations and the number of partitions.
A RDD as an entity is gone (= disappears) when the driver spawns tasks on executors.
And when the tasks run nothing is available to help them to know how to run RDDs that are part of their work. And hence the error. Spark is very cautious about it and checks such anomalies before they could cause issues after tasks are executed.

How to `reduce` only within partitions in Spark Streaming, perhaps using combineByKey?

I have data already sorted by key into my Spark Streaming partitions by virtue of Kafka, i.e. keys found on one node are not found on any other nodes.
I would like to use redis and its incrby (increment by) command as a state engine and to reduce the number of requests sent to redis, I would like to partially reduce my data by doing a word count on each worker node by itself. (The key is tag+timestamp to obtain my functionality from word count).
I would like to avoid shuffling and let redis take care of adding data across worker nodes.
Even when I have checked that data is cleanly split among worker nodes, .reduce(_ + _) (Scala syntax) takes a long time (several seconds vs. sub-second for map tasks), as the HashPartitioner seems to shuffle my data to a random node to add it there.
How can I write a simple word count reduce on each partitioner without triggering the shuffling step in Scala with Spark Streaming?
Note DStream objects lack some RDD methods, which are available only through the transform method.
It seems I might be able to use combineByKey. I would like to skip the mergeCombiners() step and instead leave accumulated tuples where they are.
The book "Learning Spark" enigmatically says:
We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it. For example, groupByKey() disables map-side aggregation as the aggregation function (appending to a list) does not save any space. If we want to disable map-side combines, we need to specify the partitioner; for now you can just use the partitioner on the source RDD by passing rdd.partitioner.
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
The book then continues to supply no syntax for how to do this, nor have I had any luck with google so far.
What is worse, as far as I know, the partitioner is not set for DStream RDDs in Spark Streaming, so I don't know how to supply a partitioner to combineByKey that doesn't end up shuffling data.
Also, what does "map-side" actually mean and what consequences does mapSideCombine = false have, exactly?
The scala implementation for combineByKey can be found at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
Look for combineByKeyWithClassTag.
If the solution involves a custom partitioner, please include also a code sample for how to apply that partitioner to the incoming DStream.
This can be done using mapPartitions, which takes a function that maps an iterator of the input RDD on one partition to an iterator over the output RDD.
To implement a word count, I map to _._2 to remove the Kafka key and then perform a fast iterator word count using foldLeft, initializing a mutable.hashMap, which then gets converted to an Iterator to form the output RDD.
val myDstream = messages
.mapPartitions( it =>
it.map(_._2)
.foldLeft(new mutable.HashMap[String, Int])(
(count, key) => count += (key -> (count.getOrElse(key, 0) + 1))
).toIterator
)

How to partition a RDD

I have a text file consisting of a large number of random floating values separated by spaces.
I am loading this file into a RDD in scala.
How does this RDD get partitioned?
Also, is there any method to generate custom partitions such that all partitions have equal number of elements along with an index for each partition?
val dRDD = sc.textFile("hdfs://master:54310/Data/input*")
keyval=dRDD.map(x =>process(x.trim().split(' ').map(_.toDouble),query_norm,m,r))
Here I am loading multiple text files from HDFS and process is a function I am calling.
Can I have a solution with mapPartitonsWithIndex along with how can I access that index inside the process function? Map shuffles the partitions.
How does an RDD gets partitioned?
By default a partition is created for each HDFS partition, which by default is 64MB. Read more here.
How to balance my data across partitions?
First, take a look at the three ways one can repartition his data:
1) Pass a second parameter, the desired minimum number of partitions
for your RDD, into textFile(), but be careful:
In [14]: lines = sc.textFile("data")
In [15]: lines.getNumPartitions()
Out[15]: 1000
In [16]: lines = sc.textFile("data", 500)
In [17]: lines.getNumPartitions()
Out[17]: 1434
In [18]: lines = sc.textFile("data", 5000)
In [19]: lines.getNumPartitions()
Out[19]: 5926
As you can see, [16] doesn't do what one would expect, since the number of partitions the RDD has, is already greater than the minimum number of partitions we request.
2) Use repartition(), like this:
In [22]: lines = lines.repartition(10)
In [23]: lines.getNumPartitions()
Out[23]: 10
Warning: This will invoke a shuffle and should be used when you want to increase the number of partitions your RDD has.
From the docs:
The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
3) Use coalesce(), like this:
In [25]: lines = lines.coalesce(2)
In [26]: lines.getNumPartitions()
Out[26]: 2
Here, Spark knows that you will shrink the RDD and gets advantage of it. Read more about repartition() vs coalesce().
But will all this guarantee that your data will be perfectly balanced across your partitions? Not really, as I experienced in How to balance my data across the partitions?
The loaded rdd is partitioned by default partitioner: hash code. To specify custom partitioner, use can check rdd.partitionBy(), provided with your own partitioner.
I don't think it's ok to use coalesce() here, as by api docs, coalesce() can only be used when we reduce number of partitions, and even we can't specify a custom partitioner with coalesce().
You can generate custom partitions using the coalesce function:
coalesce(numPartitions: Int, shuffle: Boolean = false): RDD[T]

Does groupByKey in Spark preserve the original order?

In Spark, the groupByKey function transforms a (K,V) pair RDD into a (K,Iterable<V>) pair RDD.
Yet, is this function stable? i.e is the order in the iterable preserved from the original order?
For example, if I originally read a file of the form:
K1;V11
K2;V21
K1;V12
May my iterable for K1 be like (V12, V11) (thus not preserving the original order) or can it only be (V11, V12) (thus preserving the original order)?
No, the order is not preserved. Example in spark-shell:
scala> sc.parallelize(Seq(0->1, 0->2), 2).groupByKey.collect
res0: Array[(Int, Iterable[Int])] = Array((0,ArrayBuffer(2, 1)))
The order is timing dependent, so it can vary between runs. (I got the opposite order on my next run.)
What is happening here? groupByKey works by repartitioning the RDD with a HashPartitioner, so that all values for a key end in up in the same partition. Then it performs the aggregation locally on each partition.
The repartitioning is also called a "shuffle", because the lines of the RDD are redistributed between nodes. The shuffle files are pulled from the other nodes in parallel. The new partition is built from these pieces in the order that they arrive. The data from the slowest source will be at the end of the new partition, and at the end of the list in groupByKey.
(Data pulled from the worker itself is of course fastest. Since there is no network transfer involved here, this data is pulled synchronously, and thus arrives in order. (It seems to, at least.) So to replicate my experiment you need at least 2 Spark workers.)
Source: http://apache-spark-user-list.1001560.n3.nabble.com/Is-shuffle-quot-stable-quot-td7628.html
Spark (and other map reduce frameworks) sort data by partitioning , and then merging. Since a merge sort is a stable operation I would guess that the result is stable. After looking more into the source I found that if spark.shuffle.spill is true it uses an external sort , merge sort in this case, which is stable. I'm not 100% sure what it does if it's allowed to spill to disk.
From source:
private val externalSorting = SparkEnv.get.conf.getBoolean("spark.shuffle.spill", true)
Partitioning is also a stable operation because it does no reordering