Apache Spark : When not to use mapPartition and foreachPartition? - scala

I know that when we want to initialize some resource for a group of RDDs instead of individual RDD elements we should ideally use the mapPartition and foreachPartition. For example in case of initializing a JDBC connection for each partition of data. But are there scenarios where we should not use either of them and instead use plain vanilla map() and foreach() transformation and action.

When you write Spark jobs that uses either mapPartition or foreachPartition you can just modify the partition data itself or just iterate through partition data respectively. The anonymous function passed as parameter will be executed on the executors thus there is not a viable way to execute a code which invokes all the nodes e.g: df.reduceByKey from one particular executor. This code should be executed only from the driver node. Thus only from the driver code you can access dataframes, datasets and spark session.
Please find here a detailed discussion over this issue and possible solutions

Related

How to use Broadcast variable correctly in Spark GraphX?

I use GraphX for processing a graph. i have used GraphLoader to load it and i made a variable that contains the neighbors of each node by using below code:
val all_neighbors: VertexRDD[Array[VertexId]] = graph.collectNeighborIds(EdgeDirection.Either).cache()
because i frequently need nodes neighbors i decide to broadcast them. when i use this code i get error:
val broadcastVar = sc.broadcast(all_neighbors)
but when i use this code there is no error:
val broadcastVar = sc.broadcast(all_neighbors.collect())
is it right to use collect() for broadcasting??
and one more question. i want to change this broadcast variable to be key,value. is this code right?
val nvalues = broadcastVar.value.toMap
does the above code(i means nvalues) broadcast to all slaves in cluster?? should i broadcast nvalues too?? i am a little bit confused with broad cast subject. please help me with this problem.
There are two questions:
is it right to use collect() for broadcasting??
all_neighbors is of type VertexRDD which is essentially an RDD. There is nothing in an RDD you could broadcast. RDD is a data structure that describes a distributed computation on some datasets. By the features of RDD, you can describe what and how to compute. It's an abstract entity. You can only broadcast a real value, but an RDD is just a container of values that are only available when executors process their data.
quoting from Broadcast Variables:
Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks.
They can be used, for example, to give every node a copy of a large
input dataset in an efficient manner.
This means that explicitly creating broadcast variables is only useful
when tasks across multiple stages need the same data or when caching
the data in deserialized form is important.
That's the reason we need to perform collect the dataset that RDD holds which converts the RDD to a locally-available collection which can then be broadcasted.
Note: When you perform collect operation, data is accumulated in the driver node and then broadcasted. So if space in the driver node is less, it will throw errors
does the above code(i means nvalues) broadcast to all slaves in
cluster?? should I broadcast nvalues too??
It totally depends on your use case. If you want to only use broadcastVar, then only broadcast it or if you want to use nvalues, only broadcast nvalues or else you can broadcast both the values though you need to be careful of memory constraints.
Let me know if it helps!!

How to `reduce` only within partitions in Spark Streaming, perhaps using combineByKey?

I have data already sorted by key into my Spark Streaming partitions by virtue of Kafka, i.e. keys found on one node are not found on any other nodes.
I would like to use redis and its incrby (increment by) command as a state engine and to reduce the number of requests sent to redis, I would like to partially reduce my data by doing a word count on each worker node by itself. (The key is tag+timestamp to obtain my functionality from word count).
I would like to avoid shuffling and let redis take care of adding data across worker nodes.
Even when I have checked that data is cleanly split among worker nodes, .reduce(_ + _) (Scala syntax) takes a long time (several seconds vs. sub-second for map tasks), as the HashPartitioner seems to shuffle my data to a random node to add it there.
How can I write a simple word count reduce on each partitioner without triggering the shuffling step in Scala with Spark Streaming?
Note DStream objects lack some RDD methods, which are available only through the transform method.
It seems I might be able to use combineByKey. I would like to skip the mergeCombiners() step and instead leave accumulated tuples where they are.
The book "Learning Spark" enigmatically says:
We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it. For example, groupByKey() disables map-side aggregation as the aggregation function (appending to a list) does not save any space. If we want to disable map-side combines, we need to specify the partitioner; for now you can just use the partitioner on the source RDD by passing rdd.partitioner.
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
The book then continues to supply no syntax for how to do this, nor have I had any luck with google so far.
What is worse, as far as I know, the partitioner is not set for DStream RDDs in Spark Streaming, so I don't know how to supply a partitioner to combineByKey that doesn't end up shuffling data.
Also, what does "map-side" actually mean and what consequences does mapSideCombine = false have, exactly?
The scala implementation for combineByKey can be found at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
Look for combineByKeyWithClassTag.
If the solution involves a custom partitioner, please include also a code sample for how to apply that partitioner to the incoming DStream.
This can be done using mapPartitions, which takes a function that maps an iterator of the input RDD on one partition to an iterator over the output RDD.
To implement a word count, I map to _._2 to remove the Kafka key and then perform a fast iterator word count using foldLeft, initializing a mutable.hashMap, which then gets converted to an Iterator to form the output RDD.
val myDstream = messages
.mapPartitions( it =>
it.map(_._2)
.foldLeft(new mutable.HashMap[String, Int])(
(count, key) => count += (key -> (count.getOrElse(key, 0) + 1))
).toIterator
)

Apache-Spark Internal Job Scheduling

I came across the feature in Spark where it allows you to schedule different tasks within a spark context.
I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key valueRDD [(K1,K2),V] and a filtered RDD containing some specific values.
Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation but got more confused with the concept of pools, users and tasks.
What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group
What are users in this context. Do they refer to threads? or is it something like SQL context queries ?
I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups.
Can someone please clarify this?
All the pipelined procedure you described in Paragraph 2:
map -> map -> map -> filter
will be handled in a single stage, just like a map() in MapReduce if it is familiar to you. It's because there isn't a need for repartition or shuffle your data for your make no requirements on the correlation between records, spark would just chain as much transformation as possible into a same stage before create a new one, because it would be much lightweight. More informations on stage separation could be find in its paper: Resilient Distributed Datasets Section 5.1 Job Scheduling.
When the stage get executed, it would be one task set (same tasks running in different thread), and get scheduled simultaneously in spark's perspective.
And Fair scheduler is about to schedule unrelated task sets and not suitable here.

Spark streaming merge data

My understanding is that Spark Streaming serialises the closure (e.g. map, filter, etc) and executes it on worker nodes (as explained here). Is there some way of sending the results back to the driver program and perform further operations on the local machine?
In our specific use case, we are trying to turn the results produced by Spark into an observable stream (using RxScala).
Someone posted a comment but deleted it afterwards. He suggested using collect() on an RDD. A simple test showed that collect gathers data from the worker nodes and executes on the driver node; exactly what I needed.

Spark broadcast error: exceeds spark.akka.frameSize Consider using broadcast

I have a large data called "edges"
org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[(String, Int)]] = MappedRDD[27] at map at <console>:52
When I was working in standalone mode, I was able to collect, count and save this file. Now, on a cluster, I'm getting this error
edges.count
...
Serialized task 28:0 was 12519797 bytes which exceeds spark.akka.frameSize
(10485760 bytes). Consider using broadcast variables for large values.
Same with .saveAsTextFile("edges")
This is from the spark-shell. I have tried using the option
--driver-java-options "-Dspark.akka.frameSize=15"
But when I do that, it just hangs indefinitely. Any help would be appreciated.
** EDIT **
My standalone mode was on Spark 1.1.0 and my cluster is Spark 1.0.1.
Also, the hanging occurs when I go to count, collect or saveAs* the RDD, but defining it or doing filters on it work just fine.
The "Consider using broadcast variables for large values" error message usually indicates that you've captured some large variables in function closures. For example, you might have written something like
val someBigObject = ...
rdd.mapPartitions { x => doSomething(someBigObject, x) }.count()
which causes someBigObject to be captured and serialized with your task. If you're doing something like that, you can use a broadcast variable instead, which will cause only a reference to the object to be stored in the task itself, while the actual object data will be sent separately.
In Spark 1.1.0+, it isn't strictly necessary to use broadcast variables for this, since tasks will automatically be broadcast (see SPARK-2521 for more details). There are still reasons to use broadcast variables (such as sharing a big object across multiple actions / jobs), but you won't need to use it to avoid frame size errors.
Another option is to increase the Akka frame size. In any Spark version, you should be able to set the spark.akka.frameSize setting in SparkConf prior to creating your SparkContext. As you may have noticed, though, this is a little harder in spark-shell, where the context is created for you. In newer versions of Spark (1.1.0 and higher), you can pass --conf spark.akka.frameSize=16 when launching spark-shell. In Spark 1.0.1 or 1.0.2, you should be able to pass --driver-java-options "-Dspark.akka.frameSize=16" instead.