How to create ScalaZ Task which is running asynchronously right after creation? - scala

I need Scalaz Task (or some wrapper) which is already running, and can return value immediately if it is completed, or after some waiting if it is not. In terms of Future I could do it like this:
val f = myTask.get.started
This way I have Future running asynchronously, which on f.run returns result immediately when called after the computation is complete, or blocks for some time and waits for completion if it is not. However, this way I loose error handling.
How to have Task and not use Future, but still have it already running asynchronously before run, or runAsync is called on it?

The intention of scalaz.Task is clear control over the execution, which makes referential transparency possible. If you want to fork off the Task, use:
val result = Task.fork(myTask)
and the task will run in its own threadpool as soon as you run it with one of the unsafe* methods.

Related

Asynchronous vs Synchronous Webservice Calls

I would like to clear up what it means to be a asynchronous/synchronous and blocking/non-blocking operations in Scala Play using Scala's Future.
So we can have:
Asynchronous and non-blocking
Asynchronous and blocking
Synchronous and non-blocking
Synchronous and blocking
In the Play documentation it says that all operations are asynchronous. But I would like to know how the following would be classed:
If I call an external web service and I have to wait for the response then is this considered to be a asynchronous and blocking operation? For example:
def externalWebCall(url : String): Future[SomeData]
def storeData(data : String) : Future[Unit]
for {
data <- externalWebCall(url)
_ <- storeData(data)
} yield(data)
In this example, I don't want storeData to execute until the externalWebCall service has completed. Is the above code the right way to achieve this? If so I think this is asynchronous and blocking.
The initial request that comes to Play service is treated in an asynchronous and non-blocking way according to the documentation. My understanding of this is that when the Play server receives the request it will not block the main thread. It will then service the request and then make a call to the callback of this request and then return the result. Is that correct?
When people use the term "asynchronous", they usually mean "non-blocking", in the sense that the current execution thread will not stop and wait for the result. In general, synchronous means blocking, asynchronous means non-blocking.
It seems you are confusing "blocking" with "ordering the execution". Your example with the Futures is completely asynchronous, and it equivalent to
val result: Future[Unit] = externalWebCall(url).flatMap(storeData)
storeData can only execute when externalWebCall is done, but any code that would come after val result = ... could also execute before either externalWebCall or storeData. The execution of these methods is wrapped in a Future and run in a different thread, so you have no way of knowing when it will happen. That is what asynchronous means.
The terms blocking and asynchronous have different meanings, but are often confused or used interchangeably.
blocking means that the current thread waits for another thread to do something before executing further instructions. The current thread is blocked until the other thread completes and then it can continue.
asynchronous means that an operation happens on another thread while the current thread continues to execute. The current thread initiates an operation but then continues to execute. When the asynchronous operation completes it notifies the current thread by raising an event or calling some event handler function.
Blocking functions look like normal functions and there is not necessarily any indication that the code being called is blocking or non blocking. So a println statement may be blocking for stderr but non-blocking for stdout.
Asynchronous functions will always have a mechanism for being notified of completion of the operation and this will be visible in the interface. In Scala this usually means returning Future[T] rather than T.
In your example, both externalWebCall and storeData are asynchronous because each method returns Future.
storeData takes a data argument that is only available when externalWebCall completes, so these two operations must by definition operate sequentially.

How can my Apache Spark code emit periodic heartbeats?

I'm developing an Apache Spark job to run and I plan to deploy it as one stage in an AWS Step Function. Unfortunately the particular way that I wish to deploy it isn't directly supported by Step Functions at this time; however, Step Functions has an API for a generic Task that I can make use of. Essentially, once the task is started, it needs to periodically make a call to sendTaskHeartbeat and then on completion it needs to call sendTaskSuccess.
My Spark job is written in Scala, and I'm wondering what the best approach for running something on a timer is within the context of an Apache Spark job. I see from other answers that I could make use of java.util.concurrent or perhaps java.util.Timer, but I'm not sure how the threading would work specifically in a Spark context. Since Spark is already doing a lot to distribute my code across each node I'm not sure if there are some hidden considerations I need to be weary of (i.e. I don't really want more than one instance of my timer, I want to make sure it stops when the sparky parts of my code complete, etc.
Is it safe to use a regular Timer in a Spark job? If I did something like this:
val timer = new Timer()
val task = new TimerTask {
/* sendTaskHeartbeat */
}
timer.schedule(task, 1000L, 1000L)
val myRDD = spark.read.parquet(pathToParquetFiles)
val transformedRDD = myRDD.map( /* business logic */ )
transformedRDD.saveAsHadoopDataset(config) andThen task.cancel
Would that be sufficient? Or is there a risk that this code would lose track of the task and timer objects by the time it reaches the andThen, due to the distribution across nodes?
I believe your implement is sufficient. The timer task will only run in the driver node. (as long as you do not include them in the RDD transformation)
Only thing need to be careful is error handling. Make sure the timer task getting terminated when the transformation throws an error. otherwise your job could stuck because of timer thread is still alive.
I ended up making use of a combination of a java.util.Timer and a SparkListener. I instantiate the Timer on the onJobStart event (and only once, so if (TIMER == null) { /* instantiate */ }, because the onJobStart event seemingly can fire multiple times). And then I handle the completion activity on the onApplicationEnd event (which does only fire once). The reason I didn't use onApplicationStart was because it seemed like by the time I hooked in my listener to the Spark context, this event had already fired.

Execution context without daemon threads for futures

I am having trouble with the JVM immediately exiting using various new applications I wrote which spawn threads through the Scala 2.10 Futures + Promises framework.
It seems that at least with the default execution context, even if I'm using blocking, e.g.
future { blocking { /* work */ }}
no non-daemon thread is launched, and therefore the JVM thinks it can immediately quit.
A stupid work around is to launch a dummy Thread instance which is just waiting, but then I also need to make sure that this thread stops when the processes are done.
So how to I enforce them to run on non-daemon threads?
In looking at the default ExecutionContext attached to ExecutionContext.global, it's of the fork join variety and the Threadfactory it uses sets the threads to daemon. If you want to work around this, you could use a different ExecutionContext, one you set up yourself. If you still want the FJP variety (and you probably do as it scales the best), you should be able to look at what they are doing in ExecutionContextImpl via this link and create something similar. Or just use a cached thread pool via Executors.newCachedThreadPool as that won't shut down immediately before your futures complete.
spawn processes
If this means processes and not just tasks, then scala.sys.process spawns non-daemon threads to run OS processes.
Otherwise, if you're creating a bunch of tasks, this is what Future.sequence helps with. Then just Await ready (Future sequence List(futures)) on the main thread.

In Scala, does Futures.awaitAll terminate the thread on timeout?

So I'm writing a mini timeout library in scala, it looks very similar to the code here: How do I get hold of exceptions thrown in a Scala Future?
The function I execute is either going to complete successfully, or block forever, so I need to make sure that on a timeout the executing thread is cancelled.
Thus my question is: On a timeout, does awaitAll terminate the underlying actor, or just let it keep running forever?
One alternative that I'm considering is to use the java Future library to do this as there is an explicit cancel() method one can call.
[Disclaimer - I'm new to Scala actors myself]
As I read it, scala.actors.Futures.awaitAll waits until the list of futures are all resolved OR until the timeout. It will not Future.cancel, Thread.interrupt, or otherwise attempt to terminate a Future; you get to come back later and wait some more.
The Future.cancel may be suitable, however be aware that your code may need to participate in effecting the cancel operation - it doesn't necessarily come for free. Future.cancel cancels a task that is scheduled, but not yet started. It interrupts a running thread [setting a flag that can be checked]... which may or may not acknowledge the interrupt. Review Thread.interrupt and Thread.isInterrupted(). Your long-running task would normally check to see if it's being interrupted (your code), and self-terminate. Various methods (i.e. Thread.sleep, Object.wait and others) respond to the interrupt by throwing InterruptedException. You need to review & understand that mechanism to ensure your code will meet your needs within those constraints. See this.

Practical use of futures? Ie, how to kill them?

Futures are very convenient, but in practice, you may need some guarantees on their execution. For example, consider:
import scala.actors.Futures._
def slowFn(time:Int) = {
Thread.sleep(time * 1000)
println("%d second fn done".format(time))
}
val fs = List( future(slowFn(2)), future(slowFn(10)) )
awaitAll(5000, fs:_*)
println("5 second expiration. Continuing.")
Thread.sleep(12000) // ie more calculations
println("done with everything")
The idea is to kick off some slow running functions in parallel. But we wouldn't want to hang forever if the functions executed by the futures don't return. So we use awaitAll() to put a timeout on the futures. But if you run the code, you see that the 5 second timer expires, but the 10 second future continues to run and returns later. The timeout doesn't kill the future; it just limits the join wait.
So how do you kill a future after a timeout period? It seems like futures can't be used in practice unless you're certain that they will return in a known amount of time. Otherwise, you run the risk of losing threads in the thread pool to non-terminating futures until there are none left.
So the questions are: How do you kill futures? What are the intended usage patterns for futures given these risks?
Futures are intended to be used in settings where you do need to wait for the computation to complete, no matter what. That's why they are described as being used for slow running functions. You want that function's result, but you have other stuff you can be doing meanwhile. In fact, you might have many futures, all independent of each other that you may want to run in parallel, while you wait until all complete.
The timer just provides a wait to get partial results.
I think the reason Future can't simply be "killed" is exactly the same as why java.lang.Thread.stop() is deprecated.
While Future is running, a Thread is required. In order to stop a Future without calling stop() on the executing Thread, application specific logic is needed: checking for an application specific flag or the interrupted status of the executing Thread periodically is one way to do it.