How to implement SparkContext in Play for Scala - scala

I have the following Play for Scala controller that wraps Spark. At the end of the method I close the context to avoid the problem of having more than one context active in the same JVM:
class Test4 extends Controller {
def test4 = Action.async { request =>
val conf = new SparkConf().setAppName("AppTest").setMaster("local[2]").
set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
val rawData = sc.textFile("c:\\spark\\data.csv")
val data = rawData.map(line => line.split(',').map(_.toDouble))
val str = "count: " + data.count()
sc.close
Future { Ok(str) }
}
}
The problem that I have is that I don't know how to make this code multi-threaded as two users may access the same controller method at the same time.
UPDATE
What I'm thinking is to have N Scala programs receive messages through JMS (using ActiveMQ). Each Scala program would have a Spark session and receive messages from Play. The Scala programs will process requests sequentially as they read the queues. Does this make sense? Are there any other best practices to integrate Play and Spark?

Its better just move spark context move to new object
object SparkContext{
val conf = new SparkConf().setAppName("AppTest").setMaster("local[2]").
set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
}
Otherwise for every request new spark context is created according to your design and new JVM is started for each new spark context.
If we talk about best practices its really not good idea to use spark inside play project more better way is to create a micro service which have spark application and play application call this micro service these type of architecture is more flexible, scalable, robust.

I don't think is a good idea to execute Spark jobs from a REST api, if you just want to parallelize in your local JVM it doesn't make sense to use Spark since it is designed for distributed computing. It is also not design to be an operational database and it won't scale well when you execute several concurrent queries in the same cluster.
Anyway if you still want to execute concurrent spark queries from the same JVM you should probably use client mode to run the query in a external cluster. It is not possible to launch more than one session per JVM so I would suggest that you share the session in your service, close it just when you are finishing the service.

Related

Calling a rest service from Spark

I'm trying to figure out the best approach to call a Rest endpoint from Spark.
My current approach (solution [1]) looks something like this -
val df = ... // some dataframe
val repartitionedDf = df.repartition(numberPartitions)
lazy val restEndPoint = new restEndPointCaller() // lazy evaluation of the object which creates the connection to REST. lazy vals are also initialized once per JVM (executor)
val enrichedDf = repartitionedDf
.map(rec => restEndPoint.getResponse(rec)) // calls the rest endpoint for every record
.toDF
I know I could have used .mapPartitions() instead of .map(), but looking at the DAG, it looks like spark optimizes the repartition -> map to a mapPartition anyway.
In this second approach (solution [2]), a connection is created once for every partition and reused for all records within the partition.
val newDs = myDs.mapPartitions(partition => {
val restEndPoint = new restEndPointCaller /*creates a db connection per partition*/
val newPartition = partition.map(record => {
restEndPoint.getResponse(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
restEndPoint.close() // close dbconnection here
newPartition.iterator // create a new iterator
})
In this third approach (solution [3]), a connection is created once per JVM (executor) reused across all partitions processed by the executor.
lazy val connection = new DbConnection /*creates a db connection per partition*/
val newDs = myDs.mapPartitions(partition => {
val newPartition = partition.map(record => {
readMatchingFromDB(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
newPartition.iterator // create a new iterator
})
connection.close() // close dbconnection here
[a] With Solutions [1] and [3] which are very similar, is my understanding of how lazy val work correct? The intention is to restrict the number of connections to 1 per executor/ JVM and reuse the open connections for processing subsequent requests. Will I be creating 1 connection per JVM or 1 connection per partition?
[b] Are there any other ways by which I can control the number of requests (RPS) we make to the rest endpoint ?
[c] Please let me know if there are better and more efficient ways to do this.
Thanks!
IMO the second solution with mapPartitions is better. First, you explicitly tells what you're expecting to achieve. The name of the transformation and the implemented logic tell it pretty clearly. For the first option you need to be aware of the how Apache Spark optimizes the processing. And it's maybe obvious to you just now but you should also think about the people who will work on your code or simply about you in 6 months, 1 year, 2 years and so fort. And they should understand better the mapPartitions than repartition + map.
Moreover maybe the optimization for repartition with map will change internally (I don't believe in it but you can still consider is as a valid point) and at this moment your job will perform worse.
Finally, with the 2nd solution you avoid a lot of problems that you can encounter with the serialization. In the code you wrote the driver will create one instance of the endpoint object, serialize it and send to the executors. So yes, maybe it'll be a single instance but only if it's serializable.
[edit]
Thanks for clarification. You can achieve what are you looking for in different manners. To have exactly 1 connection per JVM you can use a design pattern called singleton. In Scala it's expressed pretty easily as an object (the first link I found on Google https://alvinalexander.com/scala/how-to-implement-singleton-pattern-in-scala-with-object)
And that it's pretty good because you don't need to serialize anything. The singletons are read directly from the classpath on the executor side. With it you're sure to have exactly one instance of given object.
[a] With Solutions [1] and [3] which are very similar, is my
understanding of how lazy val work correct? The intention is to
restrict the number of connections to 1 per executor/ JVM and reuse
the open connections for processing subsequent requests. Will I be
creating 1 connection per JVM or 1 connection per partition?
It'll create 1 connection per partition. You can execute this small test to see that:
class SerializationProblemsTest extends FlatSpec {
val conf = new SparkConf().setAppName("Spark serialization problems test").setMaster("local")
val sparkContext = SparkContext.getOrCreate(conf)
"lazy object" should "be created once per partition" in {
lazy val restEndpoint = new NotSerializableRest()
sparkContext.parallelize(0 to 120).repartition(12)
.mapPartitions(numbers => {
//val restEndpoint = new NotSerializableRest()
numbers.map(nr => restEndpoint.enrich(nr))
})
.collect()
}
}
class NotSerializableRest() {
println("Creating REST instance")
def enrich(id: Int): String = s"${id}"
}
It should print Creating REST instance 12 times (# of partitions)
[b] Are there ways by which I can control the number of requests (RPS)
we make to the rest endpoint ?
To control the number of requests you can use an approach similar to database connection pools: HTTP connection pool (one quickly found link: HTTP connection pooling using HttpClient).
But maybe another valid approach would be the processing of smaller subsets of data ? So instead of taking 30000 rows to process, you can split it into different smaller micro-batches (if it's a streaming job). It should give your web service a little bit more "rest".
Otherwise you can also try to send bulk requests (Elasticsearch does it to index/delete multiple documents at once https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html). But it's up to the web service to allow you to do so.

Is there a hook for Executor Startup in Spark?

So, basically I want multiple tasks running on the same node/executor to read data from a shared memory. For that I need some initialization function that would load the data into the memory before the tasks are started. If Spark provides a hook for an Executor startup, I could put this initialization code in that callback function, with the tasks only running after this startup is completed.
So, my question is, does Spark provides such hooks? If not, with which other method, I can achieve the same?
Spark's solution for "shared data" is using broadcast - where you load the data once in the driver application and Spark serializes it and sends to each of the executors (once). If a task uses that data, Spark will make sure it's there before the task is executed. For example:
object MySparkTransformation {
def transform(rdd: RDD[String], sc: SparkContext): RDD[Int] = {
val mySharedData: Map[String, Int] = loadDataOnce()
val broadcast = sc.broadcast(mySharedData)
rdd.map(r => broadcast.value(r))
}
}
Alternatively, if you want to avoid reading the data into driver memory and sending it over to the executors, you can use lazy values in a Scala object to create a value that gets populated once per JVM, which in Spark's case is once per executor. For example:
// must be an object, otherwise will be serialized and sent from driver
object MySharedResource {
lazy val mySharedData: Map[String, Int] = loadDataOnce()
}
// If you use mySharedData in a Spark transformation,
// the "local" copy in each executor will be used:
object MySparkTransformation {
def transform(rdd: RDD[String]): RDD[Int] = {
// Spark won't include MySharedResource.mySharedData in the
// serialized task sent from driver, since it's "static"
rdd.map(r => MySharedResource.mySharedData(r))
}
}
In practice, you'll have one copy of mySharedData in each executor.
You don't have to run multiple instances of the app to be able to run multiple tasks (i.e. one app instance, one Spark task). The same SparkSession object can be used by multiple threads to submit Spark tasks in parallel.
So it may work like this:
The application starts up and runs an initialization function to load shared data in memory. Say, into a SharedData class object.
SparkSession is created
A thread pool is created, each thread has access to (SparkSession, SharedData) objects
Each thread creates Spark task using shared SparkSession and SharedData
objects.
Depending on your use case, the application then does one of the following:
waits for all tasks to complete and then closes Spark Session
waits in a loop for new requests to arrive and creates new Spark tasks as necessary using threads from the thread pool.
SparkContext (sparkSession.sparkContext) is useful when you want to do per-thread things like assigning a task description using setJobDescription or assigning a group to the task using setJobGroup so related tasks can be cancelled simultaneously using cancelJobGroup. You can also tweak priority for the tasks that use the same pool, see https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application for details.

Network Spark Streaming from multiple remote hosts

I wrote program for Spark Streaming in scala. In my program, i passed 'remote-host' and 'remote port' under socketTextStream.
And in the remote machine, i have one perl script who is calling system command:
echo 'data_str' | nc <remote_host> <9999>
In that way, my spark program is able to get data, but it seems little bit confusing as i have multiple remote machines which needs to send data to spark machine.
I wanted to know the right way of doing it. Infact, how will i deal with data coming from multiple hosts?
For Reference, My current program:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val inputStream = ssc.socketTextStream(<remote-host>, 9999)
-------------------
-------------------
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
}
}
Thanks in advance.
You can find more information from "Level of Parallelism in Data Receiving".
Summary:
Receiving multiple data streams can therefore be achieved by creating
multiple input DStreams and configuring them to receive different
partitions of the data stream from the source(s);
These multiple DStreams can be unioned together to create a single
DStream. Then the transformations that were being applied on a single
input DStream can be applied on the unified stream.

How to use an Akka actor from the Spark nodes of a cluster

Having a Spark cluster running a certain application, I would like to use an Akka actor to stream data from within each of the Spark nodes in the cluster. That is: the nodes are processing data in some way, and in parallel the actor is sending some other data within the node to an external process.
Now, these are the possible options:
Just create the ActorRef through a regular ActorSystem: not possible as the ActorSystem instance is not Serializable and it will fail at runtime
Use the Spark internal ActorSystem to create the actor: not a good option since Spark 1.4 as Spark.get.ActorSystem is deprecated
So what is the best way for the Spark nodes to instantiate a given actor if the options above are not valid? is it possible at all?
This question is somewhat related to this one although formulated on a wider scope
Note: I know I could somehow use Spark streaming for this scenario, but at the moment I would like to explore the feasibility of a pure Akka option

Cassandra insert performance using spark-cassandra connector

I am a newbie to spark and cassandra. I am trying to insert into cassandra table using spark-cassandra connector as below:
import java.util.UUID
import org.apache.spark.{SparkContext, SparkConf}
import org.joda.time.DateTime
import com.datastax.spark.connector._
case class TestEntity(id:UUID, category:String, name:String,value:Double, createDate:DateTime, tag:Long)
object SparkConnectorContext {
val conf = new SparkConf(true).setMaster("local")
.set("spark.cassandra.connection.host", "192.168.xxx.xxx")
val sc = new SparkContext(conf)
}
object TestRepo {
def insertList(list: List[TestEntity]) = {
SparkConnectorContext.sc.parallelize(list).saveToCassandra("testKeySpace", "testColumnFamily")
}
}
object TestApp extends App {
val start = System.currentTimeMillis()
TestRepo.insertList(Utility.generateRandomData())
val end = System.currentTimeMillis()
val timeDiff = end-start
println("Difference (in millis)= "+timeDiff)
}
When I insert using the above method (list with 100 entities), it takes 300-1100 milliseconds.
I tried the same data to insert using phantom library. It is only taking less than 20-40 milliseconds.
Can anyone tell me why spark connector is taking this much time for insert? Am I doing anything wrong in my code or is it not advisable to use spark-cassandra connector for insert operations?
It looks like you are including the parallelize operation in your timing. Also since you have your spark worker running on a different machine than Cassandra, the saveToCassandra operation will be a write over the network.
Try configuring your system to run the spark workers on the Cassandra nodes. Then create an RDD in a separate step and invoke an action like count() on it to load the data into memory. Also you might want to persist() or cache() the RDD to make sure it stays in memory for the test.
Then time just the saveToCassandra of that cached RDD.
You might also want to look at the repartitionByCassandraReplica method offered by the Cassandra connector. That would partition the data in the RDD based on which Cassandra node the writes need to go to. In that way you exploit data locality and often avoid doing writes and shuffles over the network.
There are some serious problems with your "benchmark":
Your data set is so small that you're measuring mostly only the job setup time. Saving 100 entities should be of order of single milliseconds on a single node, not seconds. Also saving 100 entities gives JVM no chance to compile the code you run to optimized machine code.
You included spark context initialization in your measurement. JVM loads classes lazily, so the code for spark initialization is really called after the measurement is started. This is an extremely costly element, typically performed only once per whole spark application, not even per job.
You're performing the measurement only once per launch. This means you're even incorrectly measuring spark ctx setup and job setup time, because the JVM has to load all the classes for the first time and Hotspot has probably no chance to kick in.
To summarize, you're very likely measuring mostly class loading time, which is dependent on the size and number of classes loaded. Spark is quite a large thing to load and a few hundred milliseconds are not surprising at all.
To measure insert performance correctly:
use larger data set
exclude one-time setup from the measurement
do multiple runs sharing the same spark context and discard a few initial ones, until you reach steady state performance.
BTW If you enable debug logging level, the connector logs the insert times for every partition in the executor logs.