Spring Boot - running scheduled jobs as separate process - mongodb

I have a spring boot application which also have few scheduled jobs. I don't see any functional issue with implementation. one of the job runs every almost second for real time updates. There are other jobs also.
I suspect there is performance issue especially when long running API hits the controller.
// Heavy Job
#Scheduled(fixedRate = 10000)
public void processAlerts(){
}
#Scheduled(fixedDelayString = "${process.events.interval}")
public void triggerTaskReadiness() throws IOException {
log.info("Trigger event processing job");
}
// Heavy Job to process data from different tables.
#Scheduled(fixedDelayString = "${app.status.interval}")
public void triggerUpdateAppHealth() throws IOException {
log.info("Trigger application health");
}
Is it possible to have jobs as separate process. What are the best practices to have spring boot application with heavy jobs.

The question is way too general, IMO. It all depends on your resources and what exactly does the job do.
Spring boot provides a general purpose scheduling mechanism but doesn't make any assumptions about the job nature.
All-in-all, its true that when you run a heavy job, CPU, network, I/O and whatever resources are consumed (Again, depending on the actual code of your job).
If you run it externally basically another process will consume the same resources assuming its being run on the same server.
From the spring boot standpoint I can say the following:
It looks like the job deal with database. In this case Spring boot supports the integration with DataSources, connection pooling, transaction management, more High Level APIs like JPA or even spring data, you can also plug in frameworks like JOOQ. Bottom line, it makes the actual work with the database much easier.
You've stated Mongodb in the question tag - well, spring has also mongo db integration in spring data.
Bottom line if you're running the job in an external process you're kind of on your own (which doesn't mean it can't be done, it just means you lose all the goodies spring has upon its sleeves)
AppHealth - spring boot already provides an actuator feature that has an endpoint of db health, it also provides a way to create your own endpoints to check the health of any concrete resource (you implement it in code so you have a freedom to check however you want). Make sure you're using the right tool for the right job.
Regarding the controller API. If you're running with traditional spring mvc, tomcat has a thread pool to serve the API, so from the Threads management point of view the threads of job won't be competing with the threads of controller, however they'll likely share the same db connection so it can become a bottleneck.
Regarding the implementation of #Scheduled. By default there will be one thread to serve all the #Scheduled jobs, which might be insufficient.
You can alter this behavior by creating your own taskScheduler:
#Bean(destroyMethod = "shutdown")
public Executor taskScheduler() {
return Executors.newScheduledThreadPool(10); // allocate 10 threads to run #Scheduled jobs
}
You might be interested to read this discussion
Spring #Scheduled always works "within the boundaries" of one spring managed application context. So that if you have decided to scale out your instances each and every instance will run the "scheduled" code and will execute heavy jobs.
Its possible to use Quartz with which spring can be integrated do to clustered mode you can configure it to pick one node every time and execute the job, but since you're planning to run every second, I doubt quartz will work good enough.
A general observation: running a set of "heavy" jobs as you say doesn't really sounds well with "running every second". It just doesn't sound reasonable, since heavy jobs tend to last much longer than 1 second, so doing this will eventually occupy all the resources and you won't be able to run more jobs.

Related

can spring batch be used as job framework for non batch jobs (regular job)

Is it possible to use spring batch as a regular job framework?
I want to create a device service (microservice) that has the responsibility
to get events and trigger jobs on devices. The devices are remote so it will take time for the job to be complete, but it is not a batch job (not periodically running or partitioning large data set).
I am wondering whether spring batch can still be used a job framework, or if it is only for batch processing. If the answer is no, what jobs framework (besides writing your own) are famous?
Job Description:
I need to execute against a specific device a job that will contain several steps. Each step will communicate with a device and wait for a device to confirm it executed the former command given to it.
I need retry, recovery and scheduling features (thought of combining spring batch with quartz)
Regarding read-process-write, I am basically getting a command request regarding a device, I do a little DB reads and then start long waiting periods that all need to pass in order for the job/task to be successful.
Also, I can choose (justify) relevant IMDG/DB. Concurrency is outside the scope (will be outside the job mechanism). An alternative that came to mind was akka actors. (job for a device will create children actors as steps)
As far as I know - not periodically running or partitioning large data set are not primary requirements for usage of Spring Batch.
Spring Batch is basically a read - process - write framework where reading & processing happens item by item and writing happens in chunks ( for chunk oriented processing ) .
So you can use Spring Batch if your job logic fits into - read - process - write paradigm and rest of the things seem secondary to me.
Also, with Spring Batch , you should also evaluate the part about Job Repository . Spring Batch needs a database ( either in memory or on disk ) to store job meta data and its not optional.
I think, you should put more explanation as why you need a Job Framework and what kind of logic you are running that you are calling it a Job so I will revise my answer accordingly.

How do I configure the FAIR scheduler with Spark-Jobserver?

When I post simultaneous jobserver requests, they always seem to be processed in FIFO mode. This is despite my best efforts to enable the FAIR scheduler. How can I ensure that my requests are always processed in parallel?
Background: On my cluster there is one SparkContext to which users can post requests to process data. Each request may act on a different chunk of data but the operations are always the same. A small one-minute job should not have to wait for a large one-hour job to finish.
Intuitively I would expect the following to happen (see my configuration below):
The context runs within a FAIR pool. Every time a user sends a request to process some data, Spark should split up the fair pool and give a fraction of the cluster resources to process that new request. Each request is then run in FIFO mode parallel to any other concurrent requests.
Here's what actually happens when I run simultaneous jobs:
The interface says "1 Fair Scheduler Pools" and it lists one active (FIFO) pool named "default." It seems that everything is executing within the same FIFO pool, which itself is running alone within the FAIR pool. I can see that my fair pool details are loaded correctly on Spark's Environment page, but my requests are all processed in FIFO fashion.
How do I configure my environment/application so that every request actually runs in parallel to others? Do I need to create a separate context for each request? Do I create an arbitrary number of identical FIFO pools within my FAIR pool and then somehow pick an empty pool every time a request is made? Considering the objectives of Jobserver, it seems like this should all be automatic and not very complicated to set up. Below are some details from my configuration in case I've made a simple mistake.
From local.conf:
contexts {
mycontext {
spark.scheduler.mode = FAIR
spark.scheduler.allocation file = /home/spark/job-server-1.6.0/scheduler.xml
spark.scheduler.pool = fair_pool
}
}
From scheduler.xml:
<?xml version="1.0"?>
<allocations>
<pool name="fair_pool">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
</pool>
</allocations>
Thanks for any ideas or pointers. Sorry for any confusion with terminology - the word "job" has two meanings in jobserver.
I was looking at my configuration and found that
spark.scheduler.allocation file should be spark.scheduler.allocation.file
and all the values are quoted like
contexts {
mycontext {
spark.scheduler.mode = "FAIR"
spark.scheduler.allocation.file = "/home/spark/job-server-1.6.0/scheduler.xml"
spark.scheduler.pool = "fair_pool"
}
}
Also ensure that mycontext is created and you are passing mycontext when submitting a job.
You can verify whether mycontext is using FAIR scheduler using Spark Master UI also.

Service with background jobs, how to ensure jobs only run periodically ONCE per cluster

I have a play framework based service that is stateless and intended to be deployed across many machines for horizontal scaling.
This service is handling HTTP JSON requests and responses, and is using CouchDB as its data store again for maximum scalability.
We have a small number of background jobs that need to be run every X seconds across the whole cluster. It is vital that the jobs do not execute concurrently on each machine.
To execute the jobs we're using Actors and the Akka Scheduler (since we're using Scala):
Akka.system().scheduler.schedule(
Duration.create(0, TimeUnit.MILLISECONDS),
Duration.create(10, TimeUnit.SECONDS),
Akka.system().actorOf(LoggingJob.props),
"tick")
(etc)
object LoggingJob {
def props = Props[LoggingJob]
}
class LoggingJob extends UntypedActor {
override def onReceive(message: Any) {
Logger.info("Job executed! " + message.toString())
}
}
Is there:
any built in trickery in Akka/Actors/Play that I've missed that will do this for me?
OR a recognised algorithm that I can put on top of Couchbase (distributed mutex? not quite?) to do this?
I do not want to make any of the instances 'special' as it needs to be very simple to deploy and manage.
Check out Akka's Cluster Singleton Pattern.
For some use cases it is convenient and sometimes also mandatory to
ensure that you have exactly one actor of a certain type running
somewhere in the cluster.
Some examples:
single point of responsibility for certain cluster-wide consistent decisions, or coordination of actions across the cluster system
single entry point to an external system
single master, many workers
centralized naming service, or routing logic
Using a singleton should not be the first design choice. It has
several drawbacks, such as single-point of bottleneck. Single-point of
failure is also a relevant concern, but for some cases this feature
takes care of that by making sure that another singleton instance will
eventually be started.

Persistent scheduling

I'm currently in need of persistent scheduling for a web app based on play-framework and akka. I know there is actor scheduling in akka, but as far as I know, it provides no mechanism to persist jobs. So, even if pretty much everything fails, jobs have to be loaded, and executed, after a restart. The jobs are generally not going to be periodic.
What kind of system can accomplish those things, and possibly nicely integrate into the existing infrastructure (play, akka)?
There seems to be a project capable of doing "timestamp based persistent scheduling for Akka": https://github.com/odd/akkax-scheduling
We are using Quartz, it's written in Java, but there is a good persistence mechanism which can use either RAM store or some database (we are using Mongo)
Another alternative is db-scheduler, a persistent cluster-friendly task-scheduler I am the author of. It is easily embeddable in a JVM-app, and requires only a single database-table for persistence. (Note: it is designed for small to medium workloads)
You can try using the scheduling mechanism in Akka.
http://doc.akka.io/docs/akka/2.1.4/scala/scheduler.html
For example:
//Schedules a function to be executed (send the current time) to the testActor after 50ms
system.scheduler.scheduleOnce(50 milliseconds) {
testActor ! System.currentTimeMillis
}

High Throughput and Windows Workflow Foundation

Can WWF handle high throughput scenarios where several dozen records are 'actively' being processed in parallel at any one time?
We want to build a workflow process which handles a few thousand records per hour. Each record takes up to a minute to process, because it makes external web service calls.
We are testing Windows Workflow Foundation to do this. But our demo programs show processing of each record appear to be running in sequence not in parallel, when we use parallel activities to process several records at once within one workflow instance.
Should we use multiple workflow instances or parallel activities?
Are there any known patterns for high performance WWF processing?
You should definitely use a new workflow per record. Each workflow only gets one thread to run in, so even with a ParallelActivity they'll still be handled sequentially.
I'm not sure about the performance of Windows Workflow, but from what I have heard about .NET 4 at Tech-Ed was that its Workflow components will be dramatically faster then the ones from .NET 3.0 and 3.5. So if you really need a lot of performance, maybe you should consider waiting for .NET 4.0.
Another option could be to consider BizTalk. But it's pretty expensive.
I think the common pattern is to use one workflow instance per record. The workflow runtime runs multiple instances in parallel.
One workflow instance runs one thread at a time. The parallel activity calls Execute method of each activity sequentially on this single thread. You may still get performance improvement from parallel activity however, if the activities are asynchronous and spend most of the time waiting for external process to finish its work. E.g. if activity calls an external web method, and then waits for a reply - it returns from Execute method and does not occupy this thread while waiting for the reply, so another activity in the Parallel group can start its job (e.g. also call to a web service) at the same time.