Is there a hook for Executor Startup in Spark? - scala

So, basically I want multiple tasks running on the same node/executor to read data from a shared memory. For that I need some initialization function that would load the data into the memory before the tasks are started. If Spark provides a hook for an Executor startup, I could put this initialization code in that callback function, with the tasks only running after this startup is completed.
So, my question is, does Spark provides such hooks? If not, with which other method, I can achieve the same?

Spark's solution for "shared data" is using broadcast - where you load the data once in the driver application and Spark serializes it and sends to each of the executors (once). If a task uses that data, Spark will make sure it's there before the task is executed. For example:
object MySparkTransformation {
def transform(rdd: RDD[String], sc: SparkContext): RDD[Int] = {
val mySharedData: Map[String, Int] = loadDataOnce()
val broadcast = sc.broadcast(mySharedData)
rdd.map(r => broadcast.value(r))
}
}
Alternatively, if you want to avoid reading the data into driver memory and sending it over to the executors, you can use lazy values in a Scala object to create a value that gets populated once per JVM, which in Spark's case is once per executor. For example:
// must be an object, otherwise will be serialized and sent from driver
object MySharedResource {
lazy val mySharedData: Map[String, Int] = loadDataOnce()
}
// If you use mySharedData in a Spark transformation,
// the "local" copy in each executor will be used:
object MySparkTransformation {
def transform(rdd: RDD[String]): RDD[Int] = {
// Spark won't include MySharedResource.mySharedData in the
// serialized task sent from driver, since it's "static"
rdd.map(r => MySharedResource.mySharedData(r))
}
}
In practice, you'll have one copy of mySharedData in each executor.

You don't have to run multiple instances of the app to be able to run multiple tasks (i.e. one app instance, one Spark task). The same SparkSession object can be used by multiple threads to submit Spark tasks in parallel.
So it may work like this:
The application starts up and runs an initialization function to load shared data in memory. Say, into a SharedData class object.
SparkSession is created
A thread pool is created, each thread has access to (SparkSession, SharedData) objects
Each thread creates Spark task using shared SparkSession and SharedData
objects.
Depending on your use case, the application then does one of the following:
waits for all tasks to complete and then closes Spark Session
waits in a loop for new requests to arrive and creates new Spark tasks as necessary using threads from the thread pool.
SparkContext (sparkSession.sparkContext) is useful when you want to do per-thread things like assigning a task description using setJobDescription or assigning a group to the task using setJobGroup so related tasks can be cancelled simultaneously using cancelJobGroup. You can also tweak priority for the tasks that use the same pool, see https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application for details.

Related

In Spark, how objects and variables are kept in memory and across different executors?

In Spark, how objects and variables are kept in memory and across different executors?
I am using:
Spark 3.0.0
Scala 2.12
I am working on writing a Spark Structured Streaming job with a custom Stream Source. Before the execution of the spark query, I create a bunch of metadata which is used by my Spark Streaming Job
I am trying to understand how this metadata is kept in memory across different executors?
Example Code:
case class JobConfig(fieldName: String, displayName: String, castTo: String)
val jobConfigs:List[JobConfig] = build(); //build the job configs
val query = spark
.readStream
.format("custom-streaming")
.load
query
.writeStream
.trigger(Trigger.ProcessingTime(2, TimeUnit.MINUTES))
.foreachBatch { (batchDF: DataFrame, batchId: Long) => {
CustomJobExecutor.start(jobConfigs) //CustomJobExecutor does data frame transformations and save the data in PostgreSQL.
}
}.outputMode(OutputMode.Append()).start().awaitTermination()
Need help in understanding following:
In the sample code, how Spark will keep "jobConfigs" in memory across different executors?
Is there any added advantage of broadcasting?
What is the efficient way of keeping the variables which can't be deserialized?
Local variables are copied for each task meanwhile broadcasted variables are copied only per executor. From docs
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
It means that if your jobConfigs is large enough and the number of tasks and stages where the variable is used significantly larger than the number of executors, or deserialization is time-consuming, in that case, broadcast variables can make a difference. In other cases, they don't.

Spark Streaming - Refresh Static Data

I have a Spark Streaming job, which when it starts, queries Hive and creates a Map[Int, String] object, which is then used for parts of the calculations the job performs.
The problem I have is that the data in Hive has the potential changes every 2 hours. I would like to have the ability to refresh the static data on a schedule, without having to restart the Spark Job every time.
The initial load of the Map object takes around a 1minute.
Any help is very welcome.
You can use a listener. Which will be triggered every time when a job is started for any stream within the spark context. Since your db is updated every two hours there is no harm updating it every-time AFAIK.
sc.addSparkListener(new SparkListener() {
override def onSparkListenerJobStart(jobStart: SparkListenerJobStart) {
//load data that to the map that will be sent to executor
}
});

How to implement SparkContext in Play for Scala

I have the following Play for Scala controller that wraps Spark. At the end of the method I close the context to avoid the problem of having more than one context active in the same JVM:
class Test4 extends Controller {
def test4 = Action.async { request =>
val conf = new SparkConf().setAppName("AppTest").setMaster("local[2]").
set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
val rawData = sc.textFile("c:\\spark\\data.csv")
val data = rawData.map(line => line.split(',').map(_.toDouble))
val str = "count: " + data.count()
sc.close
Future { Ok(str) }
}
}
The problem that I have is that I don't know how to make this code multi-threaded as two users may access the same controller method at the same time.
UPDATE
What I'm thinking is to have N Scala programs receive messages through JMS (using ActiveMQ). Each Scala program would have a Spark session and receive messages from Play. The Scala programs will process requests sequentially as they read the queues. Does this make sense? Are there any other best practices to integrate Play and Spark?
Its better just move spark context move to new object
object SparkContext{
val conf = new SparkConf().setAppName("AppTest").setMaster("local[2]").
set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
}
Otherwise for every request new spark context is created according to your design and new JVM is started for each new spark context.
If we talk about best practices its really not good idea to use spark inside play project more better way is to create a micro service which have spark application and play application call this micro service these type of architecture is more flexible, scalable, robust.
I don't think is a good idea to execute Spark jobs from a REST api, if you just want to parallelize in your local JVM it doesn't make sense to use Spark since it is designed for distributed computing. It is also not design to be an operational database and it won't scale well when you execute several concurrent queries in the same cluster.
Anyway if you still want to execute concurrent spark queries from the same JVM you should probably use client mode to run the query in a external cluster. It is not possible to launch more than one session per JVM so I would suggest that you share the session in your service, close it just when you are finishing the service.

Spark Streaming: How to change the value of external variables in foreachRDD function?

the code for testing:
object MaxValue extends Serializable{
var max = 0
}
object Test {
def main(args: Array[String]): Unit = {
val sc = new SparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val seq = Seq("testData")
val rdd = ssc.sparkContext.parallelize(seq)
val inputDStream = new ConstantInputDStream(ssc, rdd)
inputDStream.foreachRDD(rdd => { MaxValue.max = 10 }) //I change MaxValue.max value to 10.
val map = inputDStream.map(a => MaxValue.max)
map.print //Why the result is 0? Why not 10?
ssc.start
ssc.awaitTermination
}
}
In this case, how to change the value of MaxValue.max in foreachRDD()? The result of map.print is 0, why not 10. I want to use RDD.max() in foreachRDD(), so I need change MaxValue.max value in foreachRDD().
Could you help me? Thank you!
This is not possible. Remember, operations inside of an RDD method are run distributed. So, the change to MaxValue.max will only be executed on the worker, not the driver. Maybe if you say what you are trying to do that can help lead to a better solution, using accumulators maybe?
In general it is better to avoid trying to accumulate values this way, there are different ways like accumulators or updateStateByKey that would do this properly.
To give a better perspective of what is happening in your code, let's say you have 1 driver and multiple partitions distributed on multiple executors (most typical scenario)
Runs on driver
inputDStream.foreachRDD(rdd => { MaxValue.max = 10 })
The block of code within foreachRDD runs on driver, so it updates object MaxValue on the driver
Runs on executors
val map = inputDStream.map(a => MaxValue.max)
Will run lambda on each executor individually, therefore will get value from MaxValue on executors (that were never updated before). Also please note that each executor will have their own version of MaxValue object as each of them live in separate JVM process (most often on separate nodes within cluster too).
When you change your code to
val map = inputDStream.map(a => {MaxValue.max=10; MaxValue.max})
you actually updating MaxValue on executors and then getting it on executors as well - so it works.
This should work as well:
val map = inputDStream.map(a => {MaxValue.max=10; a}).map(a => MaxValue.max)
However if you do something like:
val map = inputDStream.map(a => {MaxValue.max= new Random().nextInt(10); a}).map(a => MaxValue.max)
you should get set of records with 4 different integers (each partition will have different MaxValue)
Unexpected results
local mode
The good reason to avoid is that you can get even less predictable results depending on the situation. For example if your run your original code that returns 0 on cluster it will return 10 in local mode as in this case driver and all partitions will live in a single JVM process and will share this object. So you can even create unit tests on such code, feel safe but when deploy to cluster - start getting problems.
Jobs scheduling order
For this one I'm not 100% sure - trying to find in the source code, but there is a possibility of another problem that might occur. In your code you will have 2 jobs:
One is based on your output from
inputDStream.foreachRDD another is based on map.print output. Despite they use same stream initially, Spark will generate two separate DAGs for them and will schedule two separate Jobs that can be treated by spark totally independently, in fact - it doesn't even have to guarantee the order of execution of jobs (it does guarantee order of execution of stages obviously within a job) and if this happens in theory it can run 2nd job before 1st to make results even less predictable

Cassandra insert performance using spark-cassandra connector

I am a newbie to spark and cassandra. I am trying to insert into cassandra table using spark-cassandra connector as below:
import java.util.UUID
import org.apache.spark.{SparkContext, SparkConf}
import org.joda.time.DateTime
import com.datastax.spark.connector._
case class TestEntity(id:UUID, category:String, name:String,value:Double, createDate:DateTime, tag:Long)
object SparkConnectorContext {
val conf = new SparkConf(true).setMaster("local")
.set("spark.cassandra.connection.host", "192.168.xxx.xxx")
val sc = new SparkContext(conf)
}
object TestRepo {
def insertList(list: List[TestEntity]) = {
SparkConnectorContext.sc.parallelize(list).saveToCassandra("testKeySpace", "testColumnFamily")
}
}
object TestApp extends App {
val start = System.currentTimeMillis()
TestRepo.insertList(Utility.generateRandomData())
val end = System.currentTimeMillis()
val timeDiff = end-start
println("Difference (in millis)= "+timeDiff)
}
When I insert using the above method (list with 100 entities), it takes 300-1100 milliseconds.
I tried the same data to insert using phantom library. It is only taking less than 20-40 milliseconds.
Can anyone tell me why spark connector is taking this much time for insert? Am I doing anything wrong in my code or is it not advisable to use spark-cassandra connector for insert operations?
It looks like you are including the parallelize operation in your timing. Also since you have your spark worker running on a different machine than Cassandra, the saveToCassandra operation will be a write over the network.
Try configuring your system to run the spark workers on the Cassandra nodes. Then create an RDD in a separate step and invoke an action like count() on it to load the data into memory. Also you might want to persist() or cache() the RDD to make sure it stays in memory for the test.
Then time just the saveToCassandra of that cached RDD.
You might also want to look at the repartitionByCassandraReplica method offered by the Cassandra connector. That would partition the data in the RDD based on which Cassandra node the writes need to go to. In that way you exploit data locality and often avoid doing writes and shuffles over the network.
There are some serious problems with your "benchmark":
Your data set is so small that you're measuring mostly only the job setup time. Saving 100 entities should be of order of single milliseconds on a single node, not seconds. Also saving 100 entities gives JVM no chance to compile the code you run to optimized machine code.
You included spark context initialization in your measurement. JVM loads classes lazily, so the code for spark initialization is really called after the measurement is started. This is an extremely costly element, typically performed only once per whole spark application, not even per job.
You're performing the measurement only once per launch. This means you're even incorrectly measuring spark ctx setup and job setup time, because the JVM has to load all the classes for the first time and Hotspot has probably no chance to kick in.
To summarize, you're very likely measuring mostly class loading time, which is dependent on the size and number of classes loaded. Spark is quite a large thing to load and a few hundred milliseconds are not surprising at all.
To measure insert performance correctly:
use larger data set
exclude one-time setup from the measurement
do multiple runs sharing the same spark context and discard a few initial ones, until you reach steady state performance.
BTW If you enable debug logging level, the connector logs the insert times for every partition in the executor logs.