can spring batch be used as job framework for non batch jobs (regular job)

can spring batch be used as job framework for non batch jobs (regular job) - spring-batch

Is it possible to use spring batch as a regular job framework?
I want to create a device service (microservice) that has the responsibility
to get events and trigger jobs on devices. The devices are remote so it will take time for the job to be complete, but it is not a batch job (not periodically running or partitioning large data set).
I am wondering whether spring batch can still be used a job framework, or if it is only for batch processing. If the answer is no, what jobs framework (besides writing your own) are famous?
Job Description:
I need to execute against a specific device a job that will contain several steps. Each step will communicate with a device and wait for a device to confirm it executed the former command given to it.
I need retry, recovery and scheduling features (thought of combining spring batch with quartz)
Regarding read-process-write, I am basically getting a command request regarding a device, I do a little DB reads and then start long waiting periods that all need to pass in order for the job/task to be successful.
Also, I can choose (justify) relevant IMDG/DB. Concurrency is outside the scope (will be outside the job mechanism). An alternative that came to mind was akka actors. (job for a device will create children actors as steps)

As far as I know - not periodically running or partitioning large data set are not primary requirements for usage of Spring Batch.
Spring Batch is basically a read - process - write framework where reading & processing happens item by item and writing happens in chunks ( for chunk oriented processing ) .
So you can use Spring Batch if your job logic fits into - read - process - write paradigm and rest of the things seem secondary to me.
Also, with Spring Batch , you should also evaluate the part about Job Repository . Spring Batch needs a database ( either in memory or on disk ) to store job meta data and its not optional.
I think, you should put more explanation as why you need a Job Framework and what kind of logic you are running that you are calling it a Job so I will revise my answer accordingly.

Related

Scheduler Processing using Spring batch

we have a requirement to process millions of records using spring batch . We have planned to use a Spring Batch to do this by reading the db using JdbcPagingItemReaderBuilder and process in chunks and write it to Kaafka Queue. The active consumers of the queue will process the chunks of data and update the db
The consumer task is to iterate every item from the chunk and invoke the external api's.
In case the external system is down or not responding with success response , there should be retries of atleast 3 times and considering that each task in the chunk has to do this, what would be the ideal approach?
Another use case to consider, what happens when the job is processing and the system goes down and say that the job has already processed 10000 record and the remaining records are yet to be processed . After the restart how to make sure the execution doesnt restart the entire process from beginning and to resume from the point of failure.

Spring Batch creates the following tables. You can use them to check the status of your job and customize your scheduler to behave in a way you see fit.
I'd use the step execution Id in BATCH_STEP_EXCECUTION to validate the status that's set and then retry based off on that status, Or something similar to that sense.
BATCH_JOB_EXECUTION
BATCH_JOB_EXECUTION_CONTEXT
BATCH_JOB_EXECUTION_PARAMS
BATCH_JOB_INSTANCE
BATCH_STEP_EXECUTION

Spring Boot - running scheduled jobs as separate process

I have a spring boot application which also have few scheduled jobs. I don't see any functional issue with implementation. one of the job runs every almost second for real time updates. There are other jobs also.
I suspect there is performance issue especially when long running API hits the controller.
// Heavy Job
#Scheduled(fixedRate = 10000)
public void processAlerts(){
}
#Scheduled(fixedDelayString = "${process.events.interval}")
public void triggerTaskReadiness() throws IOException {
log.info("Trigger event processing job");
}
// Heavy Job to process data from different tables.
#Scheduled(fixedDelayString = "${app.status.interval}")
public void triggerUpdateAppHealth() throws IOException {
log.info("Trigger application health");
}
Is it possible to have jobs as separate process. What are the best practices to have spring boot application with heavy jobs.

The question is way too general, IMO. It all depends on your resources and what exactly does the job do.
Spring boot provides a general purpose scheduling mechanism but doesn't make any assumptions about the job nature.
All-in-all, its true that when you run a heavy job, CPU, network, I/O and whatever resources are consumed (Again, depending on the actual code of your job).
If you run it externally basically another process will consume the same resources assuming its being run on the same server.
From the spring boot standpoint I can say the following:
It looks like the job deal with database. In this case Spring boot supports the integration with DataSources, connection pooling, transaction management, more High Level APIs like JPA or even spring data, you can also plug in frameworks like JOOQ. Bottom line, it makes the actual work with the database much easier.
You've stated Mongodb in the question tag - well, spring has also mongo db integration in spring data.
Bottom line if you're running the job in an external process you're kind of on your own (which doesn't mean it can't be done, it just means you lose all the goodies spring has upon its sleeves)
AppHealth - spring boot already provides an actuator feature that has an endpoint of db health, it also provides a way to create your own endpoints to check the health of any concrete resource (you implement it in code so you have a freedom to check however you want). Make sure you're using the right tool for the right job.
Regarding the controller API. If you're running with traditional spring mvc, tomcat has a thread pool to serve the API, so from the Threads management point of view the threads of job won't be competing with the threads of controller, however they'll likely share the same db connection so it can become a bottleneck.
Regarding the implementation of #Scheduled. By default there will be one thread to serve all the #Scheduled jobs, which might be insufficient.
You can alter this behavior by creating your own taskScheduler:
#Bean(destroyMethod = "shutdown")
public Executor taskScheduler() {
return Executors.newScheduledThreadPool(10); // allocate 10 threads to run #Scheduled jobs
}
You might be interested to read this discussion
Spring #Scheduled always works "within the boundaries" of one spring managed application context. So that if you have decided to scale out your instances each and every instance will run the "scheduled" code and will execute heavy jobs.
Its possible to use Quartz with which spring can be integrated do to clustered mode you can configure it to pick one node every time and execute the job, but since you're planning to run every second, I doubt quartz will work good enough.
A general observation: running a set of "heavy" jobs as you say doesn't really sounds well with "running every second". It just doesn't sound reasonable, since heavy jobs tend to last much longer than 1 second, so doing this will eventually occupy all the resources and you won't be able to run more jobs.

Spring Batch and Executors Framework

Are these 2 Frameworks used for same purpose. If not why and how these are used in real time applications ? Are there any tutorials to learn these.

Spring batch is meant for batch processing of files by executing them in a series of jobs .Batch processing could be reading from CSV or XML or any flat file and write it to DB .Spring Batch provides many made Classes to read/write CSV, XML and database.
http://www.mkyong.com/tutorials/spring-batch-tutorial/
Java Executor service on the other hand is all about spawning multiple threads in a thread pool and executing them for any purpose be it batch or anything else,but here you have better control because of transaction management.Also , its a feature introduced from Java 5 onwards.Also there are many methods depends on you want to compute on a response or you don't want a response from future object .
http://tutorials.jenkov.com/java-util-concurrent/executorservice.html

Retry failed writing operations without delaying other steps in Spring Batch application

I am maintaining a legacy application written using Spring Batch and need to tweak it to never lose data.
I have to read from various webservice (one for each step) and then write to a remote database. Things goes bad when connection with the DB drops because all itens read from webservice are discarded (can't read the same item twice), and the data is lost because can not be written.
I need to setup Spring Batch to keep already read data on one step to retry the writing operation next time the step runs. The same step can not read more data until the write operation is successfully concluded.
When not being able to write, the step should keep the read data and pass execution to the next step, after a while, when it's time to the failed step to run again, it should not read another item, retrying the failed writing operation instead.
The batch application should runs in an infinite loop and each step should gather data from one different source. Failed writing operations should be momentarily skipped (keeping the read data) to not delay others steps but should resume from the write operation next time they are called.
I am researching in various web sources aside from official docs, but Spring Batch hasn't the most intuitive docs I have come across.
Can this be achieved? If yes, how?

You can write the data you need to persist in case the job fails to the Batch Step's ExecutionContext. You can restart the job again with this data:
Step executions are represented by objects of the StepExecution class.
Each execution contains a reference to its corresponding step and
JobExecution, and transaction related data such as commit and rollback
count and start and end times. Additionally, each step execution will
contain an ExecutionContext, which contains any data a developer needs
persisted across batch runs, such as statistics or state information
needed to restart
More from: http://static.springsource.org/spring-batch/reference/html/domain.html#domainStepExecution

I do not know if this will be ok with you, but here are my thoughts on your configuration.
Since you have two remote sources that are open to failure, let us partition the overall system with two jobs (not two steps)
JOB A
Step 1: Tasklet
Check a shared folder for files. If files exist, do not proceed to the next step. Will be more understandable when writing about JOB B
Step 2: Webservice to files
Read from your web service and write results to flatfiles in the shared folder. Since you would be using flatfiles for the output, you will solve your "all items read from webservice are discarded and the data is lost because can not be written."
Use Quartz or equivalent for the scheduling of this job.
JOB B
Poll the shared folder for generated files and create a joblauncher with the file (file.getWhere as a jobparameter). Spring integration project may help in this polling.
Step 1:
Read from the file, write them to remote db and move/delete file if writing to db is successful.
No scheduling will be needed since job launching originates from polled in files.
Sample Execution
Time 0: No file in the shared folder
Time 1: Read from web service and write to shared folder
Time 2: Job B file polling occurs, tries to write to db.
If successfull, the system continues to execute.
If not, when Job A tries to execute on its scheduled time, it will skip reading from web service since files still exist in the shared folder. It will skip until Job B consumes the files.
I did not want to go into implementation specifics but Spring Batch can handle all of these situations. Hope that this helps.

High Throughput and Windows Workflow Foundation

Can WWF handle high throughput scenarios where several dozen records are 'actively' being processed in parallel at any one time?
We want to build a workflow process which handles a few thousand records per hour. Each record takes up to a minute to process, because it makes external web service calls.
We are testing Windows Workflow Foundation to do this. But our demo programs show processing of each record appear to be running in sequence not in parallel, when we use parallel activities to process several records at once within one workflow instance.
Should we use multiple workflow instances or parallel activities?
Are there any known patterns for high performance WWF processing?

You should definitely use a new workflow per record. Each workflow only gets one thread to run in, so even with a ParallelActivity they'll still be handled sequentially.
I'm not sure about the performance of Windows Workflow, but from what I have heard about .NET 4 at Tech-Ed was that its Workflow components will be dramatically faster then the ones from .NET 3.0 and 3.5. So if you really need a lot of performance, maybe you should consider waiting for .NET 4.0.
Another option could be to consider BizTalk. But it's pretty expensive.

I think the common pattern is to use one workflow instance per record. The workflow runtime runs multiple instances in parallel.
One workflow instance runs one thread at a time. The parallel activity calls Execute method of each activity sequentially on this single thread. You may still get performance improvement from parallel activity however, if the activities are asynchronous and spend most of the time waiting for external process to finish its work. E.g. if activity calls an external web method, and then waits for a reply - it returns from Execute method and does not occupy this thread while waiting for the reply, so another activity in the Parallel group can start its job (e.g. also call to a web service) at the same time.