Get current number of running containers in Spark on YARN - scala

I have a Spark application running on top of yarn.
Having an RDD I need to execute a query against the database.
The problem is that I have to set proper connection options otherwise the database will be overloaded. And these options depend on the number of workers that query this DB simultaneously. To solve this problem I want to detect the current number of running workers in runtime (from a worker).
Something like that:
val totalDesiredQPS = 1000 //queries per second
val queries: RDD[String] = ???
queries.mapPartitions(it => {
val dbClientForThisWorker = ...
//TODO: get this information from YARN somehow
val numberOfContainers = ???
val dbClientForThisWorker.setQPS(totalDesiredQPS / numberOfContainers) => dbClientForThisWorker.executeAsync...)
Also I appreciate alternative solutions but I want to avoid shuffle and get almost full db utilization no matter what the number of worker is.


Spark - parallel computation for different dataframes

A premise: this question might sound idiotic, but I guess I fell into confusion and/ignorance.
The question is: does Spark already optmize its physical plan to execute computations on unrelated dataframes to be in parallel? If not, would it be advisable to try and parallelize such processes? Example below.
Let's assume I have the following scenario:
val df1 = read table into dataframe
val df2 = read another table into dataframe
val aTransformationOnDf1 = df1.filter(condition).doSomething
val aSubSetOfTransformationOnDf1 = aTransformationOnDf1.doSomeOperations
// Push to Kafka
val anotherTransformationOnDf1WithDf2 = df1.filter(anotherCondition).join(df2).doSomethingElse
val yetAnotherTransformationOnDf1WithDf2 = df1.filter(aThirdCondition).join(df2).doAnotherThing
val unionAllTransformation = aTransformationOnDf1
Basically I have two initial dataframes. One is an avent log with past events and new events to process. As an example:
a subset of these new events must be processed and pushed to Kafka.
a subset of the past events could have updates, so they must be processed alone
another subset of the past events could have another kind of updates, so they must be processed alone
In the end, all processed events are unified in one dataframe to be written back to the events' log table.
Question: does Spark process the different subsets in parallel or sequentially (and onyl computation within each individual dataframe is performed distributedly)?
If not, could we enforce parallel computation of each individual subset before the union? I know Scala has a Future propery, though I never used it.
Something like>
def unionAllDataframes(df1: DataFrame, df2: DataFrame, df3: DataFrame): Future[DafaFrame] = {
Future { df1.union(df2).union(df2) }
// At the end
val finalDf = unionAllDataframes(
case Success(df) => df.write(etc...)
case Failure(exception) => handleException(exception)
Sorry for the horrendous design and probably the wrong usage of Future. Once again, I am a bit confused on this scenario and I am trying to micro-optimize this passage (if possible).
Thanks a lot in advance!

Calling a rest service from Spark

I'm trying to figure out the best approach to call a Rest endpoint from Spark.
My current approach (solution [1]) looks something like this -
val df = ... // some dataframe
val repartitionedDf = df.repartition(numberPartitions)
lazy val restEndPoint = new restEndPointCaller() // lazy evaluation of the object which creates the connection to REST. lazy vals are also initialized once per JVM (executor)
val enrichedDf = repartitionedDf
.map(rec => restEndPoint.getResponse(rec)) // calls the rest endpoint for every record
I know I could have used .mapPartitions() instead of .map(), but looking at the DAG, it looks like spark optimizes the repartition -> map to a mapPartition anyway.
In this second approach (solution [2]), a connection is created once for every partition and reused for all records within the partition.
val newDs = myDs.mapPartitions(partition => {
val restEndPoint = new restEndPointCaller /*creates a db connection per partition*/
val newPartition = => {
restEndPoint.getResponse(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
restEndPoint.close() // close dbconnection here
newPartition.iterator // create a new iterator
In this third approach (solution [3]), a connection is created once per JVM (executor) reused across all partitions processed by the executor.
lazy val connection = new DbConnection /*creates a db connection per partition*/
val newDs = myDs.mapPartitions(partition => {
val newPartition = => {
readMatchingFromDB(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
newPartition.iterator // create a new iterator
connection.close() // close dbconnection here
[a] With Solutions [1] and [3] which are very similar, is my understanding of how lazy val work correct? The intention is to restrict the number of connections to 1 per executor/ JVM and reuse the open connections for processing subsequent requests. Will I be creating 1 connection per JVM or 1 connection per partition?
[b] Are there any other ways by which I can control the number of requests (RPS) we make to the rest endpoint ?
[c] Please let me know if there are better and more efficient ways to do this.
IMO the second solution with mapPartitions is better. First, you explicitly tells what you're expecting to achieve. The name of the transformation and the implemented logic tell it pretty clearly. For the first option you need to be aware of the how Apache Spark optimizes the processing. And it's maybe obvious to you just now but you should also think about the people who will work on your code or simply about you in 6 months, 1 year, 2 years and so fort. And they should understand better the mapPartitions than repartition + map.
Moreover maybe the optimization for repartition with map will change internally (I don't believe in it but you can still consider is as a valid point) and at this moment your job will perform worse.
Finally, with the 2nd solution you avoid a lot of problems that you can encounter with the serialization. In the code you wrote the driver will create one instance of the endpoint object, serialize it and send to the executors. So yes, maybe it'll be a single instance but only if it's serializable.
Thanks for clarification. You can achieve what are you looking for in different manners. To have exactly 1 connection per JVM you can use a design pattern called singleton. In Scala it's expressed pretty easily as an object (the first link I found on Google
And that it's pretty good because you don't need to serialize anything. The singletons are read directly from the classpath on the executor side. With it you're sure to have exactly one instance of given object.
[a] With Solutions [1] and [3] which are very similar, is my
understanding of how lazy val work correct? The intention is to
restrict the number of connections to 1 per executor/ JVM and reuse
the open connections for processing subsequent requests. Will I be
creating 1 connection per JVM or 1 connection per partition?
It'll create 1 connection per partition. You can execute this small test to see that:
class SerializationProblemsTest extends FlatSpec {
val conf = new SparkConf().setAppName("Spark serialization problems test").setMaster("local")
val sparkContext = SparkContext.getOrCreate(conf)
"lazy object" should "be created once per partition" in {
lazy val restEndpoint = new NotSerializableRest()
sparkContext.parallelize(0 to 120).repartition(12)
.mapPartitions(numbers => {
//val restEndpoint = new NotSerializableRest() => restEndpoint.enrich(nr))
class NotSerializableRest() {
println("Creating REST instance")
def enrich(id: Int): String = s"${id}"
It should print Creating REST instance 12 times (# of partitions)
[b] Are there ways by which I can control the number of requests (RPS)
we make to the rest endpoint ?
To control the number of requests you can use an approach similar to database connection pools: HTTP connection pool (one quickly found link: HTTP connection pooling using HttpClient).
But maybe another valid approach would be the processing of smaller subsets of data ? So instead of taking 30000 rows to process, you can split it into different smaller micro-batches (if it's a streaming job). It should give your web service a little bit more "rest".
Otherwise you can also try to send bulk requests (Elasticsearch does it to index/delete multiple documents at once But it's up to the web service to allow you to do so.

Active executors on one spark partition

Is there any possibility that multiples executor of the same node work on the same partition, for example during a reduceByKey working on spark 1.6.2.
I have results that i don't understand. After the reduceByKey when i look the keys, the same appear multiple time, as many as the number of executor per node i suppose. Moreover when i kill one of the two slaves i note the same result.
There are the same key 2 times, i presume it's due to the number of executor per node which is by default set to 2.
val rdd = sc.parallelize(1 to 1000).map(x=>(x%5,x))
val rrdd = rdd.reduceByKey(_+_)
And i obtain
rrdd.count = 10
Rather than what i suppose which is
rrdd.count = 5
I tried this
val rdd2 = rdd.partitionBy(new HashPartitioner(8))
val rrdd = rdd2.reduceByKey(_+_)
And that one
val rdd3 = rdd.reduceByKey(new HashPartitioner(8), _+_)
Without obtain what i want.
Of course i can decrease the number of executor to one, but we will loose in efficiency with more than 5cores by executor.
I tried code above on spark-shell localy it works like a charm but when it comes to go on a cluster it fails...
I'm suddenly wondering if a partition is to big, is she divided with other nodes which can be a good strategy depending the case, not mine obviously ;)
So i humbly ask your help to solve this little mystery.

Spark Streaming: How to change the value of external variables in foreachRDD function?

the code for testing:
object MaxValue extends Serializable{
var max = 0
object Test {
def main(args: Array[String]): Unit = {
val sc = new SparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val seq = Seq("testData")
val rdd = ssc.sparkContext.parallelize(seq)
val inputDStream = new ConstantInputDStream(ssc, rdd)
inputDStream.foreachRDD(rdd => { MaxValue.max = 10 }) //I change MaxValue.max value to 10.
val map = => MaxValue.max)
map.print //Why the result is 0? Why not 10?
In this case, how to change the value of MaxValue.max in foreachRDD()? The result of map.print is 0, why not 10. I want to use RDD.max() in foreachRDD(), so I need change MaxValue.max value in foreachRDD().
Could you help me? Thank you!
This is not possible. Remember, operations inside of an RDD method are run distributed. So, the change to MaxValue.max will only be executed on the worker, not the driver. Maybe if you say what you are trying to do that can help lead to a better solution, using accumulators maybe?
In general it is better to avoid trying to accumulate values this way, there are different ways like accumulators or updateStateByKey that would do this properly.
To give a better perspective of what is happening in your code, let's say you have 1 driver and multiple partitions distributed on multiple executors (most typical scenario)
Runs on driver
inputDStream.foreachRDD(rdd => { MaxValue.max = 10 })
The block of code within foreachRDD runs on driver, so it updates object MaxValue on the driver
Runs on executors
val map = => MaxValue.max)
Will run lambda on each executor individually, therefore will get value from MaxValue on executors (that were never updated before). Also please note that each executor will have their own version of MaxValue object as each of them live in separate JVM process (most often on separate nodes within cluster too).
When you change your code to
val map = => {MaxValue.max=10; MaxValue.max})
you actually updating MaxValue on executors and then getting it on executors as well - so it works.
This should work as well:
val map = => {MaxValue.max=10; a}).map(a => MaxValue.max)
However if you do something like:
val map = => {MaxValue.max= new Random().nextInt(10); a}).map(a => MaxValue.max)
you should get set of records with 4 different integers (each partition will have different MaxValue)
Unexpected results
local mode
The good reason to avoid is that you can get even less predictable results depending on the situation. For example if your run your original code that returns 0 on cluster it will return 10 in local mode as in this case driver and all partitions will live in a single JVM process and will share this object. So you can even create unit tests on such code, feel safe but when deploy to cluster - start getting problems.
Jobs scheduling order
For this one I'm not 100% sure - trying to find in the source code, but there is a possibility of another problem that might occur. In your code you will have 2 jobs:
One is based on your output from
inputDStream.foreachRDD another is based on map.print output. Despite they use same stream initially, Spark will generate two separate DAGs for them and will schedule two separate Jobs that can be treated by spark totally independently, in fact - it doesn't even have to guarantee the order of execution of jobs (it does guarantee order of execution of stages obviously within a job) and if this happens in theory it can run 2nd job before 1st to make results even less predictable

Cassandra insert performance using spark-cassandra connector

I am a newbie to spark and cassandra. I am trying to insert into cassandra table using spark-cassandra connector as below:
import java.util.UUID
import org.apache.spark.{SparkContext, SparkConf}
import org.joda.time.DateTime
import com.datastax.spark.connector._
case class TestEntity(id:UUID, category:String, name:String,value:Double, createDate:DateTime, tag:Long)
object SparkConnectorContext {
val conf = new SparkConf(true).setMaster("local")
.set("", "")
val sc = new SparkContext(conf)
object TestRepo {
def insertList(list: List[TestEntity]) = {"testKeySpace", "testColumnFamily")
object TestApp extends App {
val start = System.currentTimeMillis()
val end = System.currentTimeMillis()
val timeDiff = end-start
println("Difference (in millis)= "+timeDiff)
When I insert using the above method (list with 100 entities), it takes 300-1100 milliseconds.
I tried the same data to insert using phantom library. It is only taking less than 20-40 milliseconds.
Can anyone tell me why spark connector is taking this much time for insert? Am I doing anything wrong in my code or is it not advisable to use spark-cassandra connector for insert operations?
It looks like you are including the parallelize operation in your timing. Also since you have your spark worker running on a different machine than Cassandra, the saveToCassandra operation will be a write over the network.
Try configuring your system to run the spark workers on the Cassandra nodes. Then create an RDD in a separate step and invoke an action like count() on it to load the data into memory. Also you might want to persist() or cache() the RDD to make sure it stays in memory for the test.
Then time just the saveToCassandra of that cached RDD.
You might also want to look at the repartitionByCassandraReplica method offered by the Cassandra connector. That would partition the data in the RDD based on which Cassandra node the writes need to go to. In that way you exploit data locality and often avoid doing writes and shuffles over the network.
There are some serious problems with your "benchmark":
Your data set is so small that you're measuring mostly only the job setup time. Saving 100 entities should be of order of single milliseconds on a single node, not seconds. Also saving 100 entities gives JVM no chance to compile the code you run to optimized machine code.
You included spark context initialization in your measurement. JVM loads classes lazily, so the code for spark initialization is really called after the measurement is started. This is an extremely costly element, typically performed only once per whole spark application, not even per job.
You're performing the measurement only once per launch. This means you're even incorrectly measuring spark ctx setup and job setup time, because the JVM has to load all the classes for the first time and Hotspot has probably no chance to kick in.
To summarize, you're very likely measuring mostly class loading time, which is dependent on the size and number of classes loaded. Spark is quite a large thing to load and a few hundred milliseconds are not surprising at all.
To measure insert performance correctly:
use larger data set
exclude one-time setup from the measurement
do multiple runs sharing the same spark context and discard a few initial ones, until you reach steady state performance.
BTW If you enable debug logging level, the connector logs the insert times for every partition in the executor logs.