Memory efficient way to repartition a large dataset by key and applying a function separately for each group batch-by-batch - scala

I have a large spark scala Dataset with a "groupName" column. Data records are spread along different partitions. I want to group records together by "groupName", collect batch-by-batch and apply a function on entire batch.
By "batch" I mean a predefined number of records (let's call it maxBatchCount) of the same group. By "batch-by-batch" I mean I want to use memory efficiently and not collect all partition to memory.
To be more specific, the batch function includes serialization, compression and encryption of the entire batch. This is later transformed into another dataset to be written to hdfs using partitionBy("groupName"). Therefore I can't avoid a full shuffling.
Is there a simple way for doing this? I made some attempt described below but TL/DR it seemed a bit over complicated and it eventually failed on Java memory issues.
Details
I tried to use a combination of repartition("groupName"), mapPartitions and Iterator's grouped(maxBatchCount) method which seemed very fit to the task. However, the repartitioning only makes sure records of the same groupName will be in the same partition, but a single partition might have records from several different groupName (if #groups > #partitions) and they can be scattered around inside the partition. So now I still need to do some grouping inside each partition first. The problem is that from mapPartition I get an Iterator which doesn't seem to have such API and I don't want to collect all data to memory.
Then I tried to enhance the above solution with Iterator's partition method. The idea is to first iterate the complete partition for building a Set of all the present groups and then use Iterator.partition to build a separate iterator for each of the present groups. And then use grouped as before.
It goes something like this - for illustration I used a simple case class of two Ints, and groupName is actually mod3 column, created by applying modulo 3 function for each number in the Range:
case class Mod3(number: Int, mod3: Int)
val maxBatchCount = 5
val df = spark.sparkContext.parallelize(Range(1,21))
.toDF("number").withColumn("mod3", col("number") % 3)
// here I choose #partitions < #groups for illustration
val dff = df.repartition(1, col("mod3"))
val dsArr = dff.as[Mod3].mapPartitions(partitionIt => {
// we'll need 2 iterations
val (it1, it2) = partitionIt.duplicate
// first iterate to create a Set of all present groups
val mod3set = it1.map(_.mod3).toSet
// build partitioned iterators map (one for each group present)
var it: Iterator[Mod3] = it2 // init var
val itMap = mod3set.map(mod3val => {
val (filteredIt, residueIt) = it.partition(_.mod3 == mod3val)
val pair = (mod3val -> filteredIt)
it = residueIt
pair
}).toMap
mod3set.flatMap(mod3val => {
itMap(mod3val).grouped(maxBatchCount).map(grp => {
val batch = grp.toList
batch.map(_.number).toArray[Int] // imagine some other batch function
})
}).toIterator
}).as[Array[Int]]
val dsArrCollect = dsArr.collect
dsArrCollect.map(_.toList).foreach(println)
This seemed to work nicely when testing with small data, but when running with actual data (on an actual spark cluster with 20 executors, 2 cores each) I received java.lang.OutOfMemoryError: GC overhead limit exceeded
Note in my actual data groups sizes are highly skewed and one of the groups is about the size of all the rest of the groups combined (I guess the GC memory issue is related to that group). Because of this I also tried to combine a secondary neutral column in repartition but it didn't help.
Will appreciate any pointers here,
Thanks!

I think you have the right approach with the repartition + map partitions.
The problem is that your map partition function ends up loading the entire partitions in memory.
First solution could be to increase the number of partitions and thus reduce the number of groups/ data in a partitions.
Another solution would be to use partitionIt.flatMap and process 1 record at time , accumulating only at most 1 group data
Use sortWithinPartitions so that records from the same group are consecutive
in the flatMap function, accumulate your data and keep track of group changes.

Related

Parallelism in Cassandra read using Scala

I am trying to invoke parallel reading from Cassandra table using spark. But I am not able to invoke parallelism as only one reads is happening any given time. What approach should be followed to achieve the same?
I'd recommend you go with below approach source Russell Spitzer's Blog
Manually dividing our partitions using a Union of partial scans :
Pushing the task to the end-user is also a possibility (and the current workaround.) Most end users already understand why they have long partitions and know in general the domain their column values fall in. This makes it possible for them to manually divide up a request so that it chops up large partitions.
For example, assuming the user knows clustering column c spans from 1 to 1000000. They could write code like
val minRange = 0
val maxRange = 1000000
val numSplits = 10
val subSize = (maxRange - minRange) / numSplits
sc.union(
(minRange to maxRange by subSize)
.map(start =>
sc.cassandraTable("ks", "tab")
.where("c > $start and c < ${start + subSize}"))
)
Each RDD would contain a unique set of tasks drawing only portions of full partitions. The union operation joins all those disparate tasks into a single RDD. The maximum number of rows any single Spark Partition would draw from a single Cassandra partition would be limited to maxRange/ numSplits. This approach, while requiring user intervention, would preserve locality and would still minimize the jumps between disk sectors.
Also read-tuning-parameters

Optimal number of partitions in a grouped PairRDD in Spark

I have two pair RDDs with the structure RDD[String, Int], called rdd1 and rdd2.
Each of these RDDs is groupped by its key, and I want to execute a function over its values (so I will use mapValues method).
Does the method "GroupByKey" creates a new partition for each key or have I to specify this manually using "partitionBy"?
I understand that the partitions of a RDD won't change if I don't perform operations that change the key, so if I perform a mapValues operation on each RDD or if I perform a join operation between the previous two RDDs, the partitions of the resulting RDD won't change. Is it true?
Here we have a code example. Notice that "function" is not defined because it is not important here.
val lvl1rdd=rdd1.groupByKey()
val lvl2rdd=rdd2.groupByKey()
val lvl1_lvl2=lvl1rdd.join(lvl2rdd)
val finalrdd=lvl1_lvl2.mapValues(value => function(value))
If I join the previous RDDs and I execute a function over the values of the resulting RDD (mapValues), all the work is being done in a single worker instead of distributing the different tasks over the different workers nodes of the cluster. I mean, the desired behaviour should be to execute, in parallel, the function passed as a parameter to the mapValues method in so many nodes as the cluster allows us.
1) Avoid groupByKey operations as they act as bottleneck for network I/O and execution performance.
Prefer reduceByKey Operation in this case as the data shuffle is comparatively less than groupByKey and we can witness the difference much better if it is a larger Dataset.
val lvl1rdd = rdd1.reduceByKey(x => function(x))
val lvl1rdd = rdd2.reduceByKey(x => function(x))
//perform the Join Operation on these resultant RDD's
Application of function on RDD's seperately and joining them is far better than joining RDD's and applying a function using groupByKey()
This will also ensure the tasks get distributed among different executors and execute in parallel
Refer this link
2) The underlying partitioning technique is Hash partitioner. If we assume that our data is located in n number of partitions initially then groupByKey Operation will follow Hash mechanism.
partition = key.hashCode() % numPartitions
This will create fixed number of partitions which can be more than intial number when you use the groupByKey Operation.we can also customize the partitions to be made. For example
val result_rdd = rdd1.partitionBy(new HashPartitioner(2))
This will create 2 partitions and in this way we can set the number of partitions.
For deciding the optimal number of partitions refer this answer https://stackoverflow.com/a/40866286/7449292

Send object to specific partition with Spark

Suppose I have a RDD with nPartitions partitions, and I'm using the mapPartitionsWithIndex method, while also keeping on the driver an array x of dimension nPartitions.
Now suppose I would like to ship x(i) to partition i so that it may work on it, a naïve way to do so would be to just call x(i) in the closure, as in the following toy example :
val sc = new SparkContext()
val rdd = sc.parallelize(1 to 1000).repartition(10)
val nPartitions = rdd.partitions.length
val myArray = Array.fill(nPartitions)(math.random) //array to be shipped to executors
val result = rdd.mapPartitionsWithIndex((index,data) =>
Seq(data.map(_ * myArray(index)).sum).iterator
)
(Ignore the logic within mapPartitionsWithIndex, only the myArray(index) is what interests us.
However if my understanding is correct, this will ship the entire array myArray to all executors, as the array is in the closure. Now if we suppose the array contains large objects which may take up too much memory / serialization time, this becomes a problem.
Is there a way to avoid this, and to ship only the components of the array corresponding to the partitions within a given executor ?
This is a case of premature optimization. Sending an array as big as the number of partitions is not going to save you much vs sending just the value for the partition, if at all possible.
However, instead of sending the array as a closure, you should send the array as a
broadcast variable: http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
The main difference is that the closure is serialized and sent out for each task, while, from the doc page "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks".
Not exactly sending large objects to partitions, but an inverted approach would be to use mapPartition in conjunction with partitioning by columns. Namely, using mapPartition in this fashion would be pulling in the large object on a per partition level vs. on a per row level.

Spark RDD's - how do they work

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct.
Let's say I create an RDD:
val rdd = sc.textFile(file)
Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)?
Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
Let's say there are 100 objects in rdd, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with rdd.size as 10 or 100? Because, overall, the RDD is size 100 but locally on each node it is only 10. Am I required to make a broadcast variable prior to doing the calculation? This question is linked to the question below.
Finally, if I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
val rdd = sc.textFile(file)
Does that mean that the file is now partitioned across the nodes?
The file remains wherever it was. The elements of the resulting RDD[String] are the lines of the file. The RDD is partitioned to match the natural partitioning of the underlying file system. The number of partitions does not depend on the number of nodes you have.
It is important to understand that when this line is executed it does not read the file(s). The RDD is a lazy object and will only do something when it must. This is great because it avoids unnecessary memory usage.
For example, if you write val errors = rdd.filter(line => line.startsWith("error")), still nothing happens. If you then write val errorCount = errors.count now your sequence of operations will need to be executed because the result of count is an integer. What each worker core (executor thread) will do in parallel then, is read a file (or piece of file), iterate through its lines, and count the lines starting with "error". Buffering and GC aside, only a single line per core will be in memory at a time. This makes it possible to work with very large data without using a lot of memory.
I want to count the number of objects in the RDD, however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
There is no rdd.size method. There is rdd.count, which counts the number of elements in the RDD. rdd.map(x => x / rdd.count) will not work. The code will try to send the rdd variable to all workers and will fail with a NotSerializableException. What you can do is:
val count = rdd.count
val normalized = rdd.map(x => x / count)
This works, because count is an Int and can be serialized.
If I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
map does not change the number of elements. I don't know what you mean by "size". But yes, you need to perform an action, such as count to get anything out of the RDD. You see, no work at all is performed until you perform an action. (When you perform count, only the per-partition count will be sent back to the driver, of course, not "all the information".)
Usually, the file (or parts of the file, if it's too big) is replicated to N nodes in the cluster (by default N=3 on HDFS). It's not an intention to split every file between all available nodes.
However, for you (i.e. the client) working with file using Spark should be transparent - you should not see any difference in rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.
UPDATE: an article describing RDD internals: https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf

How to partition a RDD

I have a text file consisting of a large number of random floating values separated by spaces.
I am loading this file into a RDD in scala.
How does this RDD get partitioned?
Also, is there any method to generate custom partitions such that all partitions have equal number of elements along with an index for each partition?
val dRDD = sc.textFile("hdfs://master:54310/Data/input*")
keyval=dRDD.map(x =>process(x.trim().split(' ').map(_.toDouble),query_norm,m,r))
Here I am loading multiple text files from HDFS and process is a function I am calling.
Can I have a solution with mapPartitonsWithIndex along with how can I access that index inside the process function? Map shuffles the partitions.
How does an RDD gets partitioned?
By default a partition is created for each HDFS partition, which by default is 64MB. Read more here.
How to balance my data across partitions?
First, take a look at the three ways one can repartition his data:
1) Pass a second parameter, the desired minimum number of partitions
for your RDD, into textFile(), but be careful:
In [14]: lines = sc.textFile("data")
In [15]: lines.getNumPartitions()
Out[15]: 1000
In [16]: lines = sc.textFile("data", 500)
In [17]: lines.getNumPartitions()
Out[17]: 1434
In [18]: lines = sc.textFile("data", 5000)
In [19]: lines.getNumPartitions()
Out[19]: 5926
As you can see, [16] doesn't do what one would expect, since the number of partitions the RDD has, is already greater than the minimum number of partitions we request.
2) Use repartition(), like this:
In [22]: lines = lines.repartition(10)
In [23]: lines.getNumPartitions()
Out[23]: 10
Warning: This will invoke a shuffle and should be used when you want to increase the number of partitions your RDD has.
From the docs:
The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
3) Use coalesce(), like this:
In [25]: lines = lines.coalesce(2)
In [26]: lines.getNumPartitions()
Out[26]: 2
Here, Spark knows that you will shrink the RDD and gets advantage of it. Read more about repartition() vs coalesce().
But will all this guarantee that your data will be perfectly balanced across your partitions? Not really, as I experienced in How to balance my data across the partitions?
The loaded rdd is partitioned by default partitioner: hash code. To specify custom partitioner, use can check rdd.partitionBy(), provided with your own partitioner.
I don't think it's ok to use coalesce() here, as by api docs, coalesce() can only be used when we reduce number of partitions, and even we can't specify a custom partitioner with coalesce().
You can generate custom partitions using the coalesce function:
coalesce(numPartitions: Int, shuffle: Boolean = false): RDD[T]