I am currently trying to track requests per minute in a Spark Application to use them in another transformation. However the code below will never result in another value than the originally set value of 0 when using the variable in the transformation
var rpm: Long = 0
val requestsPerMinute = stream.countByWindow(Seconds(60), Seconds(5)).foreachRDD(rdd => {
rdd.foreach(x => {
rpm = x
})
})
stream.foreachRDD { rdd =>
rdd.foreach(x => {
//do something including parameter rpm
})
}
I assume it has to do something with parellization - what I also tries was to use an RDD or a Broadcast instead of the plain variable. However that resulted in the code not being executed.
What is the recommended way to achieve this in SparkStreaming?
EDIT:
The incoming objects are timestamped, if that helps with anything.
In Spark Streaming, there are two levels of execution:
The scheduling of operations, executed in the driver and,
The distributed computation on RDDs, executed in the cluster
There are two operations that provide access to both levels: transform and foreachRDD. In these operations, we have access to the driver's context and we have a reference to an RDD, that we can use to apply computations on it.
In the specific case of the question, to update a local variable, the operation must be executed in the driver's context:
val requestsPerMinute = stream.countByWindow(Seconds(60), Seconds(5))
requestsPerMinute.foreachRDD{ rdd =>
val computedRPM = rdd.collect()(0) // this gets the data locally
rpm = computedRPM
}
In the original case:
rdd.foreach(x => {
rpm = x
})
the closure: f(x): Long => Unit = rpm = x is serialized and executed on the cluster. The side-effects are applied in the remote context and lost after the operation finishes. At the driver level, the value of the variable never changes.
Also, note that is not a good idea to use side-effecting functions for remote execution.
Related
I am running a spark scala program for performing text scanning in input file. I am trying to achieve parallelism by using rdd.mappartition. Inside the mappartition section i am performing few checks and calling the map function to achieve parallel execution for each partition. Inside the map function i am calling a custom method where i am performing the scanning and sending the results back.
Now, the code is working fine when i submit the code using --master local[*] but the same is not working when i submit using --master yarn-cluster. It is working without any error but the call is not getting inside the mappartition itself.I verified this by placing few println statements.
Please help me with your suggestions.
Here is the sample code:
def main(args: Array[String]) {
val inputRdd = sc.textFile(inputFile,2)
val resultRdd = inputRdd.mapPartitions{ iter =>
println("Inside scanning method..")
var scanEngine = ScanEngine.getInstance();
...
....
....
var mapresult = iter.map { y =>
line = y
val last = line.lastIndexOf("|");
message = line.substring(last + 1, line.length());
getResponse(message)
}
}
val finalRdd = sc.parallelize(resultRdd.map(x => x.trim()))
finalRdd.coalesce(1, true).saveAsTextFile(hdfsOutpath)
}
def getResponse(input: String): String = {
var result = "";
val rList = new ListBuffer[String]();
try {
//logic here
}
return result;
}
If your evidence of it working is seeing Inside scanning method.. printed out, it won't show up when run on the cluster because that code is executed by the workers, not the driver.
You're going to have to go over the code in forensic detail, with an open mind and try to find why the job has no output. Usually when a job works on local mode but not on a cluster it is because of some subtlety in where the code is executed, or where output is recorded.
There's too much clipped code to provide a more specific answer.
Spark achieves parallelism using the map function as well as mapPartitions. The number of partitions determines the amount of parallelism, but each partition will execute independently whether or not you use the mapPartitions function.
There are only a few reasons to use mapPartitions over map; e.g. there isa high initialization cost for a function, but then can call it multiple times such as doing some NLP task on text
I have a weka model stored in S3 which is of size around 400MB.
Now, I have some set of record on which I want to run the model and perform prediction.
For performing prediction, What I have tried is,
Download and load the model on driver as a static object , broadcast it to all executors. Perform a map operation on prediction RDD.
----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy.
Download and load the model on driver as a static object and send it to executor in each map operation.
-----> Working (Not efficient, as in each map operation, i am passing 400MB object)
Download the model on driver and load it on each executor and cache it there. (Don't know how to do that)
Does someone have any idea how can I load the model on each executor once and cache it so that for other records I don't load it again?
You have two options:
1. Create a singleton object with a lazy val representing the data:
object WekaModel {
lazy val data = {
// initialize data here. This will only happen once per JVM process
}
}
Then, you can use the lazy val in your map function. The lazy val ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data.
elementsRDD.map { element =>
// use WekaModel.data here
}
Advantages
is more efficient, as it allows you to initialize your data once per JVM instance. This approach is a good choice when needing to initialize a database connection pool for example.
Disadvantages
Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.
You can't really free up or release the object if you need to. Usually, that's acceptable, since the OS will free up the resources when the process exits.
2. Use the mapPartition (or foreachPartition) method on the RDD instead of just map.
This allows you to initialize whatever you need for the entire partition.
elementsRDD.mapPartition { elements =>
val model = new WekaModel()
elements.map { element =>
// use model and element. there is a single instance of model per partition.
}
}
Advantages:
Provides more flexibility in the initialization and deinitialization of objects.
Disadvantages
Each partition will create and initialize a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.
Here's what worked for me even better than the lazy initializer. I created an object level pointer initialized to null, and let each executor initialize it. In the initialization block you can have run-once code. Note that each processing batch will reset local variables but not the Object-level ones.
object Thing1 {
var bigObject : BigObject = null
def main(args: Array[String]) : Unit = {
val sc = <spark/scala magic here>
sc.textFile(infile).map(line => {
if (bigObject == null) {
// this takes a minute but runs just once
bigObject = new BigObject(parameters)
}
bigObject.transform(line)
})
}
}
This approach creates exactly one big object per executor, rather than the one big object per partition of other approaches.
If you put the var bigObject : BigObject = null within the main function namespace, it behaves differently. In that case, it runs the bigObject constructor at the beginning of each partition (ie. batch). If you have a memory leak, then this will eventually kill the executor. Garbage collection would also need to do more work.
Here is what we usually do
define a singleton client that do those kind of stuff to ensure only one client is present in each executors
have a getorcreate method to create or fetch the client information, usulaly let's you have a common serving platform you want to serve for multiple different models, then we can use like concurrentmap to ensure threadsafe and computeifabsent
the getorcreate method will be called inside RDD level like transform or foreachpartition, so make sure init happen in executor level
You can achieve this by broadcasting a case object with a lazy val as follows:
case object localSlowTwo {lazy val value: Int = {Thread.sleep(1000); 2}}
val broadcastSlowTwo = sc.broadcast(localSlowTwo)
(1 to 1000).toDS.repartition(100).map(_ * broadcastSlowTwo.value.value).collect
The event timeline for this on three executors with three threads each looks as follows:
Running the last line again from the same spark-shell session does not initialize any more:
This works for me and it's threadsafe if you use singleton and synchronized like shown below
object singletonObj {
var data: dataObj =null
def getDataObj(): dataObj = this.synchronized {
if (this.data==null){
this.data = new dataObj()
}
this.data
}
}
object app {
def main(args: Array[String]): Unit = {
lazy val mydata: dataObj = singletonObj.getDataObj()
df.map(x=>{ functionA(mydata) })
}
}
In this case
val dStream : Stream[_] =
dStream.foreachRDD(a => ... )
dStream.foreachRDD(b => ... )
Do the foreach methods :
run in parallel?
run in sequence but without specific order?
The foreachRDD( a => ) before the foreachRDD( b => )?
I want to know that because I want to commit kafka offset after a database insert. ( And the db connector give just a "foreach" insert )
val dStream : Stream[_] = ...().cache()
dStream.toDb // consume the stream
dStream.foreachRDD(b => //commit offset ) //consume the stream but after the db insert
In the spark UI it's look like there is an order but I'm not sure it's reliable.
Edit : if foreachRDD( a => ) fail , do the foreachRDD( b => ) is still executed?
DStream.foreach is deprecated since Spark 0.9.0. You want the equivalent DStream.foreachRDD to begin with.
Stages in the Spark DAG are executed sequentially, as one transformation's output is usually also the input for the next transformation in the graph, but this isn't the case in your example.
What happens is that internally the RDD is divided into partitions. Each partition is ran on a different worker which is available to the cluster manager. In your example, DStream.foreach(a => ...) will execute before DStream.foreach(b => ...), but the execution within foreach will run in parallel as regards to the internal RDD being iterated.
I want to know that because I want to commit kafka offset after a
database insert.
The DStream.foreachRDD is an output transformation, meaning it will cause the Spark to materialize the graph and begin execution. You can safely assume that the insertion to the database will end prior to executing your second foreach, but keep in mind that your first foreach will be updating your database in parallel foreach partition in the RDD.
Multiple DStream.foreachRDD are not guaranteed to execute sequentially atleast till spark-streaming-2.4.0. Look at this code in JobScheduler class:
class JobScheduler(val ssc: StreamingContext) extends Logging {
// Use of ConcurrentHashMap.keySet later causes an odd runtime problem due to Java 7/8 diff
// https://gist.github.com/AlainODea/1375759b8720a3f9f094
private val jobSets: java.util.Map[Time, JobSet] = new ConcurrentHashMap[Time, JobSet]
private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
private val jobExecutor =
ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")
The JobExecutor is a thread pool and if "spark.streaming.concurrentJobs" is set to a number more than 1, there can be parallel execution if enough spark-executors are available. So make sure your settings are correct to elicit a behavior you need.
Using spray 1.3.2 with akka 2.3.6. (akka is used only for spray).
I need to read huge files and for each line make a http request.
I read the files line by line with iterator, and for each item make the request.
It run successfully for some of the lines but at some time it start to fail with:
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://default/user/IO-HTTP#-35162984]] after [60000 ms].
I first thought I overloading the service, so I set the "spray.can.host-connector.max-connections" to 1. It run much slower but I got the same errors.
Here the code:
import spray.http.MediaTypes._
val EdnType = register(
MediaType.custom(
mainType = "application",
subType = "edn",
compressible = true,
binary = false,
fileExtensions = Seq("edn")))
val pipeline = (
addHeader("Accept", "application/json")
~> sendReceive
~> unmarshal[PipelineResponse])
def postData(data: String) = {
val request = Post(pipelineUrl).withEntity(HttpEntity.apply(EdnType, data))
val responseFuture: Future[PipelineResponse] = pipeline(request)
responseFuture
}
dataLines.map { d =>
val f = postData(d)
f.onFailure { case e => println("Error - "+e)} // This is where the errors are display
f.map { p => someMoreLogic(d, p) }
}
aggrigateResults(dataLines)
I do it in such way since I don't need the entire data, just some aggregations.
How can I solve this and keep it entirely async?
Akka ask timeout is implemented via firstCompletedOf, so the timer starts when the ask is initialized.
What you seem to be doing, is spawning a Future for each line (during the map) - so all your calls execute nearly at the same time. The timeouts start counting when the futures are initialized, but there are no executor threads left for all the spawned actors to do their work. Hence the asks time out.
Instead of processing "all at once", I would suggest a more flexible approach - somewhat similar to using iteratees, or akka-streams: Work Pulling Pattern. (Github)
You provide the iterator that you already have as an Epic. Introduce a Worker actor, which will perform the call & some logic. If you spawn N workers then, there will be at most N lines being processed concurrently (and the processing pipeline may involve multiple steps). This way you can ensure that you are not overloading the executors, and the timeouts shouldn't happen.
I'm not quite sure about how Scala and Spark works, maybe I write the code in the wrong way.
The function I want to achieve is, for a given Seq[String, Int], assign a random item in v._2.path to _._2.
To do that, I implement a method and call this method in the next line
def getVerticesWithFeatureSeq(graph: Graph[WikiVertex, WikiEdge.Value]): RDD[(VertexId, WikiVertex)] = {
graph.vertices.map(v => {
//For each token in the sequence, assign an article to them based on its path(root to this node)
println(v._1+" before "+v._2.featureSequence)
v._2.featureSequence = v._2.featureSequence.map(f => (f._1, v._2.path.apply(new scala.util.Random().nextInt(v._2.path.size))))
println(v._1+" after "+v._2.featureSequence)
(v._1, v._2)
})
}
val dt = getVerticesWithFeatureSeq(wikiGraph)
When I execute it, I suppose the println should print out something, but it didn't.
If I add another line of code
dt.foreach(println)
println inside map will print correctly.
Is there some latency of spark's code execution? Like if no one is accessing a variable, the computing will be deferred or even canceled?
Is graph.vertices an RDD? That would explain your issue, since Spark transformations are lazy until no action is executed, foreach in your case:
val dt = getVerticesWithFeatureSeq(wikiGraph) //no result is computed yet, map transformation is 'recorded'
dt.foreach(println) //foreach action requires a result, this triggers the computation
RDD's remember the transformations applied and they are only computed when an action requires a result to be returned to the driver program.
You can check http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations for further details and a list of available transformations and actions.