Active executors on one spark partition - scala

Is there any possibility that multiples executor of the same node work on the same partition, for example during a reduceByKey working on spark 1.6.2.
I have results that i don't understand. After the reduceByKey when i look the keys, the same appear multiple time, as many as the number of executor per node i suppose. Moreover when i kill one of the two slaves i note the same result.
There are the same key 2 times, i presume it's due to the number of executor per node which is by default set to 2.
val rdd = sc.parallelize(1 to 1000).map(x=>(x%5,x))
val rrdd = rdd.reduceByKey(_+_)
And i obtain
rrdd.count = 10
Rather than what i suppose which is
rrdd.count = 5
I tried this
val rdd2 = rdd.partitionBy(new HashPartitioner(8))
val rrdd = rdd2.reduceByKey(_+_)
And that one
val rdd3 = rdd.reduceByKey(new HashPartitioner(8), _+_)
Without obtain what i want.
Of course i can decrease the number of executor to one, but we will loose in efficiency with more than 5cores by executor.
I tried code above on spark-shell localy it works like a charm but when it comes to go on a cluster it fails...
I'm suddenly wondering if a partition is to big, is she divided with other nodes which can be a good strategy depending the case, not mine obviously ;)
So i humbly ask your help to solve this little mystery.

Related

Memory efficient way to repartition a large dataset by key and applying a function separately for each group batch-by-batch

I have a large spark scala Dataset with a "groupName" column. Data records are spread along different partitions. I want to group records together by "groupName", collect batch-by-batch and apply a function on entire batch.
By "batch" I mean a predefined number of records (let's call it maxBatchCount) of the same group. By "batch-by-batch" I mean I want to use memory efficiently and not collect all partition to memory.
To be more specific, the batch function includes serialization, compression and encryption of the entire batch. This is later transformed into another dataset to be written to hdfs using partitionBy("groupName"). Therefore I can't avoid a full shuffling.
Is there a simple way for doing this? I made some attempt described below but TL/DR it seemed a bit over complicated and it eventually failed on Java memory issues.
Details
I tried to use a combination of repartition("groupName"), mapPartitions and Iterator's grouped(maxBatchCount) method which seemed very fit to the task. However, the repartitioning only makes sure records of the same groupName will be in the same partition, but a single partition might have records from several different groupName (if #groups > #partitions) and they can be scattered around inside the partition. So now I still need to do some grouping inside each partition first. The problem is that from mapPartition I get an Iterator which doesn't seem to have such API and I don't want to collect all data to memory.
Then I tried to enhance the above solution with Iterator's partition method. The idea is to first iterate the complete partition for building a Set of all the present groups and then use Iterator.partition to build a separate iterator for each of the present groups. And then use grouped as before.
It goes something like this - for illustration I used a simple case class of two Ints, and groupName is actually mod3 column, created by applying modulo 3 function for each number in the Range:
case class Mod3(number: Int, mod3: Int)
val maxBatchCount = 5
val df = spark.sparkContext.parallelize(Range(1,21))
.toDF("number").withColumn("mod3", col("number") % 3)
// here I choose #partitions < #groups for illustration
val dff = df.repartition(1, col("mod3"))
val dsArr = dff.as[Mod3].mapPartitions(partitionIt => {
// we'll need 2 iterations
val (it1, it2) = partitionIt.duplicate
// first iterate to create a Set of all present groups
val mod3set = it1.map(_.mod3).toSet
// build partitioned iterators map (one for each group present)
var it: Iterator[Mod3] = it2 // init var
val itMap = mod3set.map(mod3val => {
val (filteredIt, residueIt) = it.partition(_.mod3 == mod3val)
val pair = (mod3val -> filteredIt)
it = residueIt
pair
}).toMap
mod3set.flatMap(mod3val => {
itMap(mod3val).grouped(maxBatchCount).map(grp => {
val batch = grp.toList
batch.map(_.number).toArray[Int] // imagine some other batch function
})
}).toIterator
}).as[Array[Int]]
val dsArrCollect = dsArr.collect
dsArrCollect.map(_.toList).foreach(println)
This seemed to work nicely when testing with small data, but when running with actual data (on an actual spark cluster with 20 executors, 2 cores each) I received java.lang.OutOfMemoryError: GC overhead limit exceeded
Note in my actual data groups sizes are highly skewed and one of the groups is about the size of all the rest of the groups combined (I guess the GC memory issue is related to that group). Because of this I also tried to combine a secondary neutral column in repartition but it didn't help.
Will appreciate any pointers here,
Thanks!
I think you have the right approach with the repartition + map partitions.
The problem is that your map partition function ends up loading the entire partitions in memory.
First solution could be to increase the number of partitions and thus reduce the number of groups/ data in a partitions.
Another solution would be to use partitionIt.flatMap and process 1 record at time , accumulating only at most 1 group data
Use sortWithinPartitions so that records from the same group are consecutive
in the flatMap function, accumulate your data and keep track of group changes.

Spark application uses only 1 executor

I am running an application with the following code. I don't understand why only 1 executor is in use even though I have 3. When I try to increase the range, my job fails cause the task manager loses executor.
In the summary, I see a value for shuffle writes but shuffle reads are 0 (maybe cause all the data is on one node and no shuffle read needs to happen to complete the job).
val rdd: RDD[(Int, Int)] = sc.parallelize((1 to 10000000).map(k => (k -> 1)).toSeq)
val rdd2= rdd.sortByKeyWithPartition(partitioner = partitioner)
val sorted = rdd2.map((_._1))
val count_sorted = sorted.collect()
Edit: I increased the executor and driver memory and cores. I also changed the number of executors to 1 from 4. That seems to have helped. I now see shuffle read/writes on each node.
It looks like your code is ending up with only one partition for RDD. You should increase the partitions of RDD to at least 3 to utilize all 3 executors.
..maybe cause all the data is on one node
That should make you think that your RDD has only one partition, instead of 3, or more, that would eventually utilize all the executors.
So, extending on Hokam's answer, here's what I would do:
rdd.getNumPartitions
Now if that is 1, then repartition your RDD, like this:
rdd = rdd.repartition(3)
which will partition your RDD into 3 partitions.
Try executing your code again now.

Spark Streaming: How to change the value of external variables in foreachRDD function?

the code for testing:
object MaxValue extends Serializable{
var max = 0
}
object Test {
def main(args: Array[String]): Unit = {
val sc = new SparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val seq = Seq("testData")
val rdd = ssc.sparkContext.parallelize(seq)
val inputDStream = new ConstantInputDStream(ssc, rdd)
inputDStream.foreachRDD(rdd => { MaxValue.max = 10 }) //I change MaxValue.max value to 10.
val map = inputDStream.map(a => MaxValue.max)
map.print //Why the result is 0? Why not 10?
ssc.start
ssc.awaitTermination
}
}
In this case, how to change the value of MaxValue.max in foreachRDD()? The result of map.print is 0, why not 10. I want to use RDD.max() in foreachRDD(), so I need change MaxValue.max value in foreachRDD().
Could you help me? Thank you!
This is not possible. Remember, operations inside of an RDD method are run distributed. So, the change to MaxValue.max will only be executed on the worker, not the driver. Maybe if you say what you are trying to do that can help lead to a better solution, using accumulators maybe?
In general it is better to avoid trying to accumulate values this way, there are different ways like accumulators or updateStateByKey that would do this properly.
To give a better perspective of what is happening in your code, let's say you have 1 driver and multiple partitions distributed on multiple executors (most typical scenario)
Runs on driver
inputDStream.foreachRDD(rdd => { MaxValue.max = 10 })
The block of code within foreachRDD runs on driver, so it updates object MaxValue on the driver
Runs on executors
val map = inputDStream.map(a => MaxValue.max)
Will run lambda on each executor individually, therefore will get value from MaxValue on executors (that were never updated before). Also please note that each executor will have their own version of MaxValue object as each of them live in separate JVM process (most often on separate nodes within cluster too).
When you change your code to
val map = inputDStream.map(a => {MaxValue.max=10; MaxValue.max})
you actually updating MaxValue on executors and then getting it on executors as well - so it works.
This should work as well:
val map = inputDStream.map(a => {MaxValue.max=10; a}).map(a => MaxValue.max)
However if you do something like:
val map = inputDStream.map(a => {MaxValue.max= new Random().nextInt(10); a}).map(a => MaxValue.max)
you should get set of records with 4 different integers (each partition will have different MaxValue)
Unexpected results
local mode
The good reason to avoid is that you can get even less predictable results depending on the situation. For example if your run your original code that returns 0 on cluster it will return 10 in local mode as in this case driver and all partitions will live in a single JVM process and will share this object. So you can even create unit tests on such code, feel safe but when deploy to cluster - start getting problems.
Jobs scheduling order
For this one I'm not 100% sure - trying to find in the source code, but there is a possibility of another problem that might occur. In your code you will have 2 jobs:
One is based on your output from
inputDStream.foreachRDD another is based on map.print output. Despite they use same stream initially, Spark will generate two separate DAGs for them and will schedule two separate Jobs that can be treated by spark totally independently, in fact - it doesn't even have to guarantee the order of execution of jobs (it does guarantee order of execution of stages obviously within a job) and if this happens in theory it can run 2nd job before 1st to make results even less predictable

Spark RDD's - how do they work

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct.
Let's say I create an RDD:
val rdd = sc.textFile(file)
Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)?
Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
Let's say there are 100 objects in rdd, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with rdd.size as 10 or 100? Because, overall, the RDD is size 100 but locally on each node it is only 10. Am I required to make a broadcast variable prior to doing the calculation? This question is linked to the question below.
Finally, if I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
val rdd = sc.textFile(file)
Does that mean that the file is now partitioned across the nodes?
The file remains wherever it was. The elements of the resulting RDD[String] are the lines of the file. The RDD is partitioned to match the natural partitioning of the underlying file system. The number of partitions does not depend on the number of nodes you have.
It is important to understand that when this line is executed it does not read the file(s). The RDD is a lazy object and will only do something when it must. This is great because it avoids unnecessary memory usage.
For example, if you write val errors = rdd.filter(line => line.startsWith("error")), still nothing happens. If you then write val errorCount = errors.count now your sequence of operations will need to be executed because the result of count is an integer. What each worker core (executor thread) will do in parallel then, is read a file (or piece of file), iterate through its lines, and count the lines starting with "error". Buffering and GC aside, only a single line per core will be in memory at a time. This makes it possible to work with very large data without using a lot of memory.
I want to count the number of objects in the RDD, however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
There is no rdd.size method. There is rdd.count, which counts the number of elements in the RDD. rdd.map(x => x / rdd.count) will not work. The code will try to send the rdd variable to all workers and will fail with a NotSerializableException. What you can do is:
val count = rdd.count
val normalized = rdd.map(x => x / count)
This works, because count is an Int and can be serialized.
If I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
map does not change the number of elements. I don't know what you mean by "size". But yes, you need to perform an action, such as count to get anything out of the RDD. You see, no work at all is performed until you perform an action. (When you perform count, only the per-partition count will be sent back to the driver, of course, not "all the information".)
Usually, the file (or parts of the file, if it's too big) is replicated to N nodes in the cluster (by default N=3 on HDFS). It's not an intention to split every file between all available nodes.
However, for you (i.e. the client) working with file using Spark should be transparent - you should not see any difference in rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.
UPDATE: an article describing RDD internals: https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf

Spark: increase number of partitions without causing a shuffle?

When decreasing the number of partitions one can use coalesce, which is great because it doesn't cause a shuffle and seems to work instantly (doesn't require an additional job stage).
I would like to do the opposite sometimes, but repartition induces a shuffle. I think a few months ago I actually got this working by using CoalescedRDD with balanceSlack = 1.0 - so what would happen is it would split a partition so that the resulting partitions location where all on the same node (so small net IO).
This kind of functionality is automatic in Hadoop, one just tweaks the split size. It doesn't seem to work this way in Spark unless one is decreasing the number of partitions. I think the solution might be to write a custom partitioner along with a custom RDD where we define getPreferredLocations ... but I thought that is such a simple and common thing to do surely there must be a straight forward way of doing it?
Things tried:
.set("spark.default.parallelism", partitions) on my SparkConf, and when in the context of reading parquet I've tried sqlContext.sql("set spark.sql.shuffle.partitions= ..., which on 1.0.0 causes an error AND not really want I want, I want partition number to change across all types of job, not just shuffles.
Watch this space
https://issues.apache.org/jira/browse/SPARK-5997
This kind of really simple obvious feature will eventually be implemented - I guess just after they finish all the unnecessary features in Datasets.
I do not exactly understand what your point is. Do you mean you have now 5 partitions, but after next operation you want data distributed to 10? Because having 10, but still using 5 does not make much senseā€¦ The process of sending data to new partitions has to happen sometime.
When doing coalesce, you can get rid of unsued partitions, for example: if you had initially 100, but then after reduceByKey you got 10 (as there where only 10 keys), you can set coalesce.
If you want the process to go the other way, you could just force some kind of partitioning:
[RDD].partitionBy(new HashPartitioner(100))
I'm not sure that's what you're looking for, but hope so.
As you know pyspark use some kind of "lazy" way of running. It will only do the computation when there is some action to do (for exemple a "df.count()" or a "df.show()". So what you can do is define the a shuffle partition between those actions.
You can write :
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=100")
# you spark code here with some transformation and at least one action
df = df.withColumn("sum", sum(df.A).over(your_window_function))
df.count() # your action
df = df.filter(df.B <10)
df = df.count()
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=10")
# you reduce the number of partition because you know you will have a lot
# less data
df = df.withColumn("max", max(df.A).over(your_other_window_function))
df.count() # your action