ETL Spring batch, Spring cloud data flow (SCDF) - spring-batch

We have a use case where data can be sourced from different sources (DB, FILE etc) and transformed and stored to various sinks (Cassandra, DB or File).We would want the ability to split the jobs and do parallel loads - looks like Spring Batch RemoteChunking provides that ability.
I am new to SCDF and Spring batch and wondering what is the best way to use it.
Is there a way to provide configuration for these jobs (source connection details, table and query) and can this be done through an UI (SCDF Server UI ?). Is it possible to compose the flow?
This will run on Kubernetes and our applications are deployed through Jenkins pipeline.

We would want the ability to split the jobs and do parallel loads - looks like Spring Batch RemoteChunking provides that ability.
I don't think you need remote chunking, you can rather run parallel jobs, where each job handles an ETL process (for a particular file, db table).
Is there a way to provide configuration for these jobs (source connection details, table and query)
Yes, those can be configured like any regular Spring Batch job is configured.
and can this be done through an UI (SCDF Server UI ?
If you make them configurable through properties of your job, you can specify them through the UI when you run the task.
Is it possible to compose the flow?
Yes, this is possible with Composed Task.

Related

Is it feasible to use one Spring Cloud Dataflow UI for multiple batch applications?

In the world of microservices, we would have multiple applications that have their own independent datasources and corresponding batch applications in their bounded context. Having this said, since SCDF requires a datasource to be configured to bring it up to monitor batch jobs, is it possible to configure a single, central SCDF server and UI to monitor all the batch jobs of different microservices(obviously with corresponding DBs) with the spring batch metadata tables intact with the applications' business tables? Asking this because it might look very clumsy, untidy and unmaintainable to keep so many SCDF servers running in the environment(Please correct me if my feeling is senseless).
Please bring me some clarity on this query. Thanks in advance.
Yes
Setup and install a single SCDF server with an associated DB. For each of your task/batch apps
override the TaskConfigurer and BatchConfigurer to accept a second datasource that refers to SCDF DB as shown here https://github.com/spring-cloud/spring-cloud-task/tree/main/spring-cloud-task-samples/multiple-datasources.
Thus the batch and task apps will report their state to the SCDF DB while still using their own DB for the work desired.

How to see the details of Spring Batch on UI?

I am using Spring Batch and my org not willing to use Spring Cloud Data Flow, is there any way we can create UI and show details of batch job and somehow also restart the batch Job?

Is it possible to register jobs from existing batch metadata tables

We're trying to create a UI screen which will be able to trigger spring batch jobs and use our existing database with job execution records.
I was able to get all existing jobNames via jobExplorer but now I get an error on jobRegistry.getJob(jobName). It seems the jobs are not registered in jobRegistry.
The actual configuration of the jobs is placed in another application. I try to trigger the job from another application (solely for batch related functions - runJob, stopJob, view executions, etc).
EDIT:
Is it possible to be able to register the jobs to JobRegistry from existing batch metadata tables? - What I mean by this is that the jobs and step beans would be recreated from existing metadata tables.
What we did for a workaround is that execution records can be retrieved using the metadata tables but the runJob, stopJob functions would need to be redirected to exposed endpoints of the batch job applications.

Batch Processing on Kubernetes

Anyone here have experience about batch processing (e.g. spring batch) on kubernetes ? Is it good idea ? How to prevent batch processing process same data if we use kubernetes auto scaling feature ? Thank you.
Anyone here have experience about batch processing (e.g. spring batch) on kubernetes ? Is it good idea ?
For Spring Batch, we (the Spring Batch team) do have some experience on the matter which we share in the following talks:
Cloud Native Batch Processing on Kubernetes, by Michael Minella
Spring Batch on Kubernetes, by me.
Running batch jobs on kubernetes can be tricky:
pods may be re-scheduled by k8s on different nodes in the middle of processing
cron jobs might be triggered twice
etc
This requires additional non-trivial work on the developer's side to make sure the batch application is fault-tolerant (resilient to node failure, pod re-scheduling, etc) and safe against duplicate job execution in a clustered environment.
Spring Batch takes care of this additional work for you and can be a good choice to run batch workloads on k8s for several reasons:
Cost efficiency: Spring Batch jobs maintain their state in an external database, which makes it possible to restart them from the last save point in case of job/node failure or pod re-scheduling
Robustness: Safe against duplicate job executions thanks to a centralized job repository
Fault-tolerance: Retry/Skip failed items in case of transient errors like a call to a web service that might be temporarily down or being re-scheduled in a cloud environment
I wrote a blog post in which I explain all these aspects in details with code examples. You can find it here: Spring Batch on Kubernetes: Efficient batch processing at scale
How to prevent batch processing process same data if we use kubernetes auto scaling feature ?
Making each job process a different data set is the way to go (a job per file for example). But there are different patterns that you might be interested in, see Job Patterns from k8s docs.

How to use Spring Cloud Dataflow to get Spring Batch Status

I have been using Spring Batch and my metadata is in DB2. I have been using Spring Batch admin API (jars) to look at the current status of various jobs and getting details about job, like number of items read, commit count, etc. Now, since Spring Batch Admin is moved to spring-data-cloud, how do look at these informations? Is there a good API set I could use?
Basically, in Spring Cloud Data flow, you first need to create Spring Cloud Task that will have your Batch application: See example [here][1]
With the help of Spring Cloud #EnableTaskLauncher you can get the current status of job, run the job, stop the job, etc.
You need to send TasKLauncherRequest for it.
See APIs of TaskLauncher
Edition:
To get spring batch status, u need to have first Task execution id of spring cloud task. Set<Long> getJobExecutionIdsByTaskExecutionId(long taskExecutionId); method of [TaskExplorer][3]
See Task Explorer for all the apis. With it, use JobExplorer to get status of jobs