Better job scheduler - spring-batch

We are trying to implement a few batch jobs using Spring Batch. Our application is batch heavy, currently, we have jobs in shell scripts. Now, we are trying to move to spring batch. We are looking for a scheduler with monitors.
We are evaluating various schedulers like Spring Cloud Data Flow, Airflow, Argo
We are checking the feasibility of these jobs running on both Kubernetes/OpenShift
We are not sure which one is good, can someone suggest which we would go for?
Things we expect:
List batch jobs in which stage along with logs
Monitor jobs (Dynatrace/Prometheus)
Complex cron jobs scheduling (more flexibility similar to that of unix cron jobs)

Related

Is it possible in airflow to run a single task on multiple worker nodes i.e running a task in distributed way

I am using spring batch to create a workflow of batch job. The single batch job takes 2 hrs to complete(data to be processed ~ 1 million) so decided to run in distributed way where one task will be distributed across multiple worker nodes, that way it can execute in short time. The other jobs (all are working in distributed manner) in workflow need to run in sequential manner one after other. The jobs are multi node distributed jobs(master/slave architecture) that need to run one after another.
Now, I was considering to deploy the workflow on airflow. So, while exploring that I could not find any way to run a single task that distributes across multiple machine. Is it possible in airflow?
Yes, you can create a task using Spark framework. Spark allows you to process the data on multiple nodes in a distributed fashion.
You can then use SparkSubmitOperator to align the task in your DAG.

What is the current recommended approach to manage/stop a spring-batch job?

We have some spring-batch jobs are triggered by autosys with shell scripts as short lived processes.
Right now there's no way to view what is going on in the spring-batch process so I was exploring ways to view the status & manage(stop) the jobs.
Spring Cloud Data Flow is one of the options that I was exploring - but it seems that may not work when jobs are scheduled with Autosys.
What are the other options that I can explore in this regard and what is the recommended approach to manage spring-batch jobs now?
To stop a job, you first need to get the ID of the job execution to stop. This can be done using the JobExplorer API that allows you to explore meta-data that Spring Batch is aware of in the job repository. Once you get the job execution ID, you can stop it by calling the JobOperator#stop method, please refer to the Stopping a job section of the reference documentation.
This is independent of any method you used to launch the job (either manually, or via a scheduler or a graphical tool) and allows you to gracefully stop a job and leave the repository in a consistent state (ready for a restart if needed).

Batch Processing on Kubernetes

Anyone here have experience about batch processing (e.g. spring batch) on kubernetes ? Is it good idea ? How to prevent batch processing process same data if we use kubernetes auto scaling feature ? Thank you.
Anyone here have experience about batch processing (e.g. spring batch) on kubernetes ? Is it good idea ?
For Spring Batch, we (the Spring Batch team) do have some experience on the matter which we share in the following talks:
Cloud Native Batch Processing on Kubernetes, by Michael Minella
Spring Batch on Kubernetes, by me.
Running batch jobs on kubernetes can be tricky:
pods may be re-scheduled by k8s on different nodes in the middle of processing
cron jobs might be triggered twice
etc
This requires additional non-trivial work on the developer's side to make sure the batch application is fault-tolerant (resilient to node failure, pod re-scheduling, etc) and safe against duplicate job execution in a clustered environment.
Spring Batch takes care of this additional work for you and can be a good choice to run batch workloads on k8s for several reasons:
Cost efficiency: Spring Batch jobs maintain their state in an external database, which makes it possible to restart them from the last save point in case of job/node failure or pod re-scheduling
Robustness: Safe against duplicate job executions thanks to a centralized job repository
Fault-tolerance: Retry/Skip failed items in case of transient errors like a call to a web service that might be temporarily down or being re-scheduled in a cloud environment
I wrote a blog post in which I explain all these aspects in details with code examples. You can find it here: Spring Batch on Kubernetes: Efficient batch processing at scale
How to prevent batch processing process same data if we use kubernetes auto scaling feature ?
Making each job process a different data set is the way to go (a job per file for example). But there are different patterns that you might be interested in, see Job Patterns from k8s docs.

Use Spring batch for not scheduled process?

I have had experience working with Spring Batch a few months but I have got a doubt a few days ago. I have to process a file and then update a database from it but this is not a scheduled batch process because it has to be executed just once.
Is Spring batch recommended to execute not scheduled processes like this one? Or the fact that is not scheduled has nothing to do with using Spring batch or not
Thanks
Is Spring batch recommended to execute not scheduled processes like this one? Or the fact that is not scheduled has nothing to do with using Spring batch or not
Yes, the fact that your job has to be executed only once has nothing to do with using Spring Batch or not. There is a difference between developing the job (using Spring Batch or not) and scheduling the job (using cron, quartz, etc).
For your use case (process a file and then update a database), I would recommend using Spring Batch to develop your job. Then, you can choose to run it:
only once or on demand (Spring Batch provides APIs to run the job)
or schedule it to run repeatedly using your favourite scheduler

spring batch job scheduling best practice

We have bunch of spring batch jobs and we need to invoke them in specific order. Is there any best practice we should follow? I was thinking of using autosys or cron scheduler based on status of each job and decide whether to invoke next one or not but open to other suggestions.
The approach sounds right, though it's harder to build something like this in cron. A scheduler tool like autosys or control-m provide the orchestration feature usually out of the box.
I have used CRON to schedule the spring batch jobs . I nearly had to schedule around 3 main jobs and 6 jobs in all of them.
I had a same scenario where the next job is dependent on the first.
In that case you can use spring batch tables to check if the previous job is Completed or not using spring batch tables.
You will find the batch tables details here - http://docs.spring.io/spring-batch/reference/html/metaDataSchema.html
The tables are -
BATCH_JOB_INSTANCE
BATCH_JOB_EXECUTION
BATCH_EXECUTION_CONTEXT
BATCH_STEP_EXECUTION
BATCH_STEP_EXECUTION
and it will be easy from CRON to schedule the jobs for you .But some how managing jobs in CRON is quite a pain.
TO use a scheduler tool you need to configure it and it will consume a good time. But once the scheduler tool is up , then it is easy to schedule and manage jobs.
In most of the cases - scheduling is one time activity. So i guess it is better not to waist time for scheduler tool , go for CRON instead.