I'm developing an Apache Spark job to run and I plan to deploy it as one stage in an AWS Step Function. Unfortunately the particular way that I wish to deploy it isn't directly supported by Step Functions at this time; however, Step Functions has an API for a generic Task that I can make use of. Essentially, once the task is started, it needs to periodically make a call to sendTaskHeartbeat and then on completion it needs to call sendTaskSuccess.
My Spark job is written in Scala, and I'm wondering what the best approach for running something on a timer is within the context of an Apache Spark job. I see from other answers that I could make use of java.util.concurrent or perhaps java.util.Timer, but I'm not sure how the threading would work specifically in a Spark context. Since Spark is already doing a lot to distribute my code across each node I'm not sure if there are some hidden considerations I need to be weary of (i.e. I don't really want more than one instance of my timer, I want to make sure it stops when the sparky parts of my code complete, etc.
Is it safe to use a regular Timer in a Spark job? If I did something like this:
val timer = new Timer()
val task = new TimerTask {
/* sendTaskHeartbeat */
}
timer.schedule(task, 1000L, 1000L)
val myRDD = spark.read.parquet(pathToParquetFiles)
val transformedRDD = myRDD.map( /* business logic */ )
transformedRDD.saveAsHadoopDataset(config) andThen task.cancel
Would that be sufficient? Or is there a risk that this code would lose track of the task and timer objects by the time it reaches the andThen, due to the distribution across nodes?
I believe your implement is sufficient. The timer task will only run in the driver node. (as long as you do not include them in the RDD transformation)
Only thing need to be careful is error handling. Make sure the timer task getting terminated when the transformation throws an error. otherwise your job could stuck because of timer thread is still alive.
I ended up making use of a combination of a java.util.Timer and a SparkListener. I instantiate the Timer on the onJobStart event (and only once, so if (TIMER == null) { /* instantiate */ }, because the onJobStart event seemingly can fire multiple times). And then I handle the completion activity on the onApplicationEnd event (which does only fire once). The reason I didn't use onApplicationStart was because it seemed like by the time I hooked in my listener to the Spark context, this event had already fired.
Related
I'm approaching Lagom + CQRS/Event Sourcing for the first time and I would like to implement a behaviour like:
A service call is executed ( for example via a REST API call )
A command is run and triggers an event that mutates the state ( for example starts some kind of timer ).
After a pre-defined interval, the timer should expire so a new event should be triggered ( without other external commands ) to mutate the state in order to invalidate the timer.
The first two steps are straightforward, but once I trigger the TimerStartedEvent and mutate the state, how do I "schedule" an event after a fixed amount of time? How do I implement the third step?
I found a possibile implementation ( actually on the online-auction-scala sample code itself ).
Because Lagom it's built on top of Akka, it injects the ActorSystem, so you can use the system.scheduler.schedule call to schedule something in the future.
To answer the CQRS part of the question, the sample code does something like:
system.scheduler.scheduler(offset, delay) {
checkFinishBidding()
}
checkFinishBidding() = {
registry.refFor[Entity](id).ask(SomeCommand)
}
So when the time fires, you can dequeue the entity ref from the registry and run a command just like a normal service call.
I need Scalaz Task (or some wrapper) which is already running, and can return value immediately if it is completed, or after some waiting if it is not. In terms of Future I could do it like this:
val f = myTask.get.started
This way I have Future running asynchronously, which on f.run returns result immediately when called after the computation is complete, or blocks for some time and waits for completion if it is not. However, this way I loose error handling.
How to have Task and not use Future, but still have it already running asynchronously before run, or runAsync is called on it?
The intention of scalaz.Task is clear control over the execution, which makes referential transparency possible. If you want to fork off the Task, use:
val result = Task.fork(myTask)
and the task will run in its own threadpool as soon as you run it with one of the unsafe* methods.
New to Akka and Actors - I need to start a few actors that will basically spend their lives reading from Kafka topics and writing to an Ignite cache. I configured my dispatcher like this:
kafka-dispatcher {
executor = "thread-pool-executor"
type = PinnedDispatcher
}
My actors are created with .withDispatcher("kafka-dispatcher"), and my assumption is that each actor will be assigned an individual thread.
These actors basically spend their lives like this:
override def receive: Receive = LoggingReceive {
case InitWorker => {
initialize()
pollTopic() // This never returns
}
}
In other words, they receive an initialization message and then call the pollTopic() method, which never returns - it runs a loop reading (which will block until there is data) and then writing the data.
My questions:
Is this kosher?
Is there a better, i.e. more idiomatic, way to do this? Note that the read call inside pollTopic() blocks.
Answering to your point 2 and from the description of what you're trying to do, maybe you want to consider using Akka streams with the reactive-kafka library. Akka streams uses actors under the hood but manages all of this for you, so you can focus only on implementing reusable small components that do exactly one thing.
You will then be able to write data processing pipelines, using Kafka as a Source for your data flow. I don't know much about Ignite cache, but chances are you either will write a Sink for it or - if you're talking about a blocking API, mapAsync will be your friend.
I have a Scala application that relies on an external RESTful webservice for some part of its functionality. We'd like to do some performance tests on the application, so we stub out the webservice with an internal class that fakes the response.
One thing we would like to keep in order to make the performance test as realistic as possible is the network lag and response time from the remote host. This is between 50 and 500 msec (we measured).
Our first attempt was to simply do a Thread.sleep(random.nextInt(450) + 50), however I don't think that's accurate - we use NIO, which is non-blocking, and Thread.sleep is blocking and locks up the whole thread.
Is there a (relatively easy / short) way to stub a method that contacts an external resource, then returns and calls a callback object when ready? The bit of code we would like to replace with a stub implementation is as follows (using Sonatype's AsyncHttpClient), where we wrap its completion handler object in one of our own that does some processing:
def getActualTravelPlan(trip: Trip, completionHandler: AsyncRequestCompletionHandler) {
val client = clientFactory.getHttpClient
val handler = new TravelPlanCompletionHandler(completionHandler)
// non-blocking call here.
client.prepareGet(buildApiURL(trip)).setRealm(realm).execute(handler)
}
Our current implementation does a Thread.sleep in the method, but that's, like I said, blocking and thus wrong.
Use a ScheduledExecutorService. It will allow you to schedule things to run at some time in the future. Executors has factory methods for creating them fairly simply.
Futures are very convenient, but in practice, you may need some guarantees on their execution. For example, consider:
import scala.actors.Futures._
def slowFn(time:Int) = {
Thread.sleep(time * 1000)
println("%d second fn done".format(time))
}
val fs = List( future(slowFn(2)), future(slowFn(10)) )
awaitAll(5000, fs:_*)
println("5 second expiration. Continuing.")
Thread.sleep(12000) // ie more calculations
println("done with everything")
The idea is to kick off some slow running functions in parallel. But we wouldn't want to hang forever if the functions executed by the futures don't return. So we use awaitAll() to put a timeout on the futures. But if you run the code, you see that the 5 second timer expires, but the 10 second future continues to run and returns later. The timeout doesn't kill the future; it just limits the join wait.
So how do you kill a future after a timeout period? It seems like futures can't be used in practice unless you're certain that they will return in a known amount of time. Otherwise, you run the risk of losing threads in the thread pool to non-terminating futures until there are none left.
So the questions are: How do you kill futures? What are the intended usage patterns for futures given these risks?
Futures are intended to be used in settings where you do need to wait for the computation to complete, no matter what. That's why they are described as being used for slow running functions. You want that function's result, but you have other stuff you can be doing meanwhile. In fact, you might have many futures, all independent of each other that you may want to run in parallel, while you wait until all complete.
The timer just provides a wait to get partial results.
I think the reason Future can't simply be "killed" is exactly the same as why java.lang.Thread.stop() is deprecated.
While Future is running, a Thread is required. In order to stop a Future without calling stop() on the executing Thread, application specific logic is needed: checking for an application specific flag or the interrupted status of the executing Thread periodically is one way to do it.