Persistent scheduling - scala

I'm currently in need of persistent scheduling for a web app based on play-framework and akka. I know there is actor scheduling in akka, but as far as I know, it provides no mechanism to persist jobs. So, even if pretty much everything fails, jobs have to be loaded, and executed, after a restart. The jobs are generally not going to be periodic.
What kind of system can accomplish those things, and possibly nicely integrate into the existing infrastructure (play, akka)?

There seems to be a project capable of doing "timestamp based persistent scheduling for Akka": https://github.com/odd/akkax-scheduling

We are using Quartz, it's written in Java, but there is a good persistence mechanism which can use either RAM store or some database (we are using Mongo)

Another alternative is db-scheduler, a persistent cluster-friendly task-scheduler I am the author of. It is easily embeddable in a JVM-app, and requires only a single database-table for persistence. (Note: it is designed for small to medium workloads)

You can try using the scheduling mechanism in Akka.
http://doc.akka.io/docs/akka/2.1.4/scala/scheduler.html
For example:
//Schedules a function to be executed (send the current time) to the testActor after 50ms
system.scheduler.scheduleOnce(50 milliseconds) {
testActor ! System.currentTimeMillis
}

Related

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

How to tune Play Framework application with proper threadpools?

I am working with Play Framework (Scala) version 2.3. From the docs:
You can’t magically turn synchronous IO into asynchronous by wrapping it in a Future. If you can’t change the application’s architecture to avoid blocking operations, at some point that operation will have to be executed, and that thread is going to block. So in addition to enclosing the operation in a Future, it’s necessary to configure it to run in a separate execution context that has been configured with enough threads to deal with the expected concurrency.
This has me a bit confused on how to tune my webapp. Specifically, since my app has a good amount of blocking calls: a mix of JDBC calls, and calls to 3rd party services using blocking SDKs, what is the strategy for configuring the execution context and determining the number of threads to provide? Do I need a separate execution context? Why can't I simply configure the default pool to have a sufficient amount of threads (and if I do this, why would I still need to wrap the calls in a Future?)?
I know this ultimately will depend on the specifics of my app, but I'm looking for some guidance on the strategy and approach. The play docs preach the use of non-blocking operations everywhere but in reality the typical web-app hitting a sql database has many blocking calls, and I got the impression from reading the docs that this type of app will perform far from optimally with the default configurations.
[...] what is the strategy for configuring the execution context and
determining the number of threads to provide
Well, that's the tricky part which depends on your individual requirements.
First of all, you probably should choose a basic profile from the docs (pure asynchronous, highly synchronous or many specific thread pools)
The second step is to fine-tune your setup by profiling and benchmarking your application
Do I need a separate execution context?
Not necessarily. But it makes sense to use separate execution contexts if you want to trigger all your blocking IO-calls at once and not in a sequential way (so database call B does not have to wait until database call A is finished).
Why can't I simply configure the default pool to have a sufficient
amount of threads (and if I do this, why would I still need to wrap
the calls in a Future?)?
You can, check the docs:
play {
akka {
akka.loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = WARNING
actor {
default-dispatcher = {
fork-join-executor {
parallelism-min = 300
parallelism-max = 300
}
}
}
}
}
With this approach, you basically are turning Play into a one-thread-per-request-model. This is not the idea behind Play, but if you're doing a lot of blocking IO calls, it's the simplest approach. In this case, you don't need to wrap your database calls in a Future.
To put it in a nutshell, you basically have three ways to go:
Only use (IO-)technologies whose API calls are non-blocking and asynchronous. This allows you to use a small threadpool / default execution context which suits the nature of Play
Turn Play into a one-thread-per-request Framework by drastically increasing the default execution context. No futures needed, just call your blocking database as always
Create specific execution contexts for your blocking IO-calls and gain fine-grained control of what you are doing
Firstly, before diving in and refactoring your app, you should determine whether this is actually a problem for you. Run some benchmarks (gatling is superb) and do a few profiles with something like JProfiler. If you can live with the current performance then happy days.
The ideal is to use a reactive driver which would return you a future that then gets passed all the way back to your controller. Unfortunately async is still an Open ticket for slick. Interacting with REST APIs can be made reactive using the PlayWS library, but if you have to go via a library that your 3rd party provides then you're stuck.
So, assuming that none of these are feasible and that you do need to improve performance, the question is what benefit would Play's suggestion have? I think what they're getting at here is that it's useful to partition your threads into those that block and those that can make use of asynchronous techniques.
If, for instance, only some proportion of your requests are long and blocking then with a single thread pool you risk all threads being used for the blocking operations. Your controller would then not be able to handle any new requests, irrespective of whether that request needs to call a blocking service. If you can allocate enough threads that this never happens then no problem.
If, on the other hand, you are hitting your limit for threads then by using two pools you can keep your fast, non-blocking requests snappy. You would have one pool servicing requests in your controller and calling into services which return futures. Some of these futures would actually be performing work using a separate pool of threads, but only for the blocking operations. If there is any portion of your app which could be made reactive, then your controller could take advantage of this while isolating the controller from the blocking operations.

Event based design - Futures, Promises vs Akka Persistence

I have multiple use cases which require predefined events to be fired based on a certain user actions.
e.g. let's say when NewUser is created in the application, it'll have to call CreateUserInWorkflowSystem and FireEmailToTheUser asynchronously. There are many other business cases of this nature where events will be predefined based on a usecase. I can use Promises/Futures to model these events as below
if 'NewUser' then
call `CreateUserInWorkflowSystem` (which will be Future based API)
call `FireEmailToTheUser` (which will be Future based API)
if 'FileImport' then
call `API3` (which will be Future based call)
call `API4` (which will be Future based call)
All those Future calls will have to log failures somewhere so failed calls can be retried etc. Note NewUser call won't be waiting for those Futures (events per say) to complete.
That was using plain Futures/Promises APIs. However I am thinking Akka Persistence will be an appropriate fit here and blocking calls can still run into Futures. With Akka persistence, handling failure will be easy as it provides it out of box etc. I understand Akka persistence is still in experimental stage but that doesn't seem to be a big concern as typesafe generally keeps these new frameworks into experimental state before promoting into future release etc. (same was true with Macros). Given these requirements do you think Futures/Promises or Akka persistence is a better fit here?
This is an opinion based question - not the best type to ask on SO. Anyway, trying to answer.
It really depends what you are more comfortable with and what your requirements are. Do you need to scale the system later beyond a single JVM - use Akka. Do you want to keep it more simple - use Futures.
If you use Futures you can store all state and actions to execute in a job queue/db. It's quite reasonable.
If you use Akka Persistence then obviously it will help you with persistence. Akka will help to perform supervison, recovery and retries easier. If your CreateUserInWorkflowSystem action fails result is propagated to supervising actor which probably restarts the failed actor and makes it retry for N times. If your supervising actor fails then his supervisor will do the right thing, or eventually the whole app will crash which is good. With Futures you would have to implement this mechanism yourself and make sure that application can crash when needed.
If you have completely independent actions then Futures and Actors sound about the same. If you have to chain actions and compose them, then using Futures will be a somewhat more natural thing to do: for comprehensions, etc. In Akka you would have to wait for a message and based on a type of a message perform next action.
Try to mock a simple implementation using both and compare what you like/dislike given your particular application requirements. Overall, both choices are good, but I'm slightly leaning towards actors in this case.

Akka: Adding a delay to a durable mailbox

I am wondering if there is some way to delay an akka message from processing?
My use case: For every request I have, I have a small amount of work that I need to do and then I need to additional work two hours later.
Is there any easy way to delay the processing of a message in AKKA? I know I can probably setup an external distributed queue such as ActiveMQ, RabbitMQ which probably has this feature but I rather not.
I know I would need to make the mailbox durable so it can survive restarts or crashes. We already have mongo setup so I probably be using the MongoBasedMailbox for durability.
Temporal Workflow is capable of supporting your use case with minimal effort. You can think about it as a Durable Actor platform. When actor state including threads and local variables is preserved across process restarts.
Temporal offers a lot of other features for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example, it allows executing a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverable failures (SAGA)
Gives complete visibility into the current state of the update. For example, when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Temporal every event is recorded.
Ability to cancel an update in flight.
Throttling of requests
See the presentation that goes over the Temporal programming model. It talks about Cadence which is the predecessor of Temporal.
It's not ideal, but the Akka Camel Quartz scheduler would do the trick. More heavyweight than the built-in ActorSystem scheduler, but know that Quartz has its own issues.
you could still use the normal Akka scheduler, you will just have to keep a state on the actor persistence to avoid loosing the job if the server restarted.
I have recently used PersistentFsmActor - which will keep the state of the actor persisted
I'm not sure in your case you have to use FSM (Finite State Machine) , so you could basically just use a persistentActor to save the time the job was inserted, and start a scheduler to that time. this way - even if you restarted the server, the actor will start and create a new scheduled job use the persistent data to calculate the time left to run it

Process work in parallel with non-threadsafe function in scala

I have a lot of work (thousands of jobs) for a Scala application to process. Each piece of work is the file name of a 100 MB file. To process each file, I need to use an extractor object that is not thread safe (I can have multiple copies, but copies are expensive, and I should not make one per job). What is the best way to complete this work in parallel in Scala?
You can wrap your extractor in an Actor and send each file name to the actor as a message. Since an instance of an actor will process only one message at a time, thread safety won't be an issue. If you want to use multiple extractors, just start multiple instances of the actor and balance between them (you could write another actor to act as a load balancer).
The extractor actor(s) can then send extracted files to other actors to do the rest of the processing in parallel.
Don't make 1000 jobs, but make 4x250 jobs (targeting 4 threads) and give one extractor to each batch. Inside each batch, work sequentially. This might not be optimal parallel-wise, since one batch might finish earlier but it is very easy to implement.
Probably the correct (but more complicated) solution would be to make a pool of extractors, where jobs take extractors from and put them back after finishing.
I would make a thread pool, where each thread has an instance of the extractor class, and instantiate just as many of these threads as it takes to saturate the system (based on CPU usage, IO bandwidth, memory bandwidth, network bandwidth, contention for other shared resources, etc.). Then use a thread-safe work queue that these threads can pull tasks from, process them, and iterate until the container is empty.
Mind you, there should be one or several libraries in just about any modern language that implements exactly this. In C++, it would be Intel's Threading Building Blocks. In Objective-C, it would be Grand Central Dispatch.
It depends: what's the relative amount of CPU consumed by the extractor for each job ?
If it is very small, you have a classic single-producer/multiple-consumer problem for which you can find lots of solution in different languages. For Scala, if you are reluctant to start using actors, you can still use the Java API (Runnable, Executors and BlockingQueue, are quite good).
If it is a substantial amount (more than 10%), you app will never scale with a multithread model (see Amdhal law). You may prefer to run several process (several JVM) to obtain thread safety, and thus eliminate the non-sequential part.
First question: how quick does the work need to be completed?
Second question: would this work be isolated to a single physical box or what are your upper bounds on computational resource.
Third question: does the work that needs doing to each individual "job" require blocking and is it serialised or could be partitioned into parallel packets of work?
Maybe think about a distributed model whereby you scale through designing with a mind to pushing out across multiple nodes from the first instance, actors, remoteref all that crap first...try and keep your logic simple and easy - so serialised. Don't just think in terms of a single box.
Most answers here seem to dwell on the intricacies of spawning thread pools and executors and all that stuff - which is fine, but be sure you have a handle on the real problem first, before you start complicating your life with lots of thinking around how you manage the synchronisation logic.
If a problem can be decomposed, then decompose it. Don't overcomplicate it for the sake of doing so - it leads to better engineered code and less sleepless nights.