Is it feasible to use one Spring Cloud Dataflow UI for multiple batch applications? - spring-batch

In the world of microservices, we would have multiple applications that have their own independent datasources and corresponding batch applications in their bounded context. Having this said, since SCDF requires a datasource to be configured to bring it up to monitor batch jobs, is it possible to configure a single, central SCDF server and UI to monitor all the batch jobs of different microservices(obviously with corresponding DBs) with the spring batch metadata tables intact with the applications' business tables? Asking this because it might look very clumsy, untidy and unmaintainable to keep so many SCDF servers running in the environment(Please correct me if my feeling is senseless).
Please bring me some clarity on this query. Thanks in advance.

Yes
Setup and install a single SCDF server with an associated DB. For each of your task/batch apps
override the TaskConfigurer and BatchConfigurer to accept a second datasource that refers to SCDF DB as shown here https://github.com/spring-cloud/spring-cloud-task/tree/main/spring-cloud-task-samples/multiple-datasources.
Thus the batch and task apps will report their state to the SCDF DB while still using their own DB for the work desired.

Related

Spring Batch and multi-instanz Kubernets Application with ONE Database

i do not 100% understand if SpringBatch, in a multi-instanz Kubernets application, work fine. I read Batch Processing on Kubernetes, so i understand that in general it works fine, but in the answer it is not mentioned that it works fine in a multi-instanz installation using ONE database.
Our setup looks like this: We have multiple instances of the application running in Kubernets and they sharing ONE database. Some jobs would be triggered by user interaction (in one of the many pod's that is answering the request from the UI) and some are triggerd by cronjob from kubernetes (eg data reorg) (in one of the many pod's that is answering the REST request from the cronjob). All pods are containing the incidental application.
Does this setup work fine with SpringBatch?
thanks for your help :-)
As far as Spring Batch is concerned, all these things are deployment details. It is up to you to design your jobs and job instances with that in mind. This is what I explained in details in the Choosing the right Spring Batch job parameters and Choosing the Right Kubernetes Job Deployment Pattern sections. Note that this blog post is linked in the answer you shared.
What Spring Batch guarantees, thanks to the centralized job repository design (which is what you are referring to as "ONE database"), is preventing duplicate and concurrent job executions of the same job instance.
So the answer to your question is yes, as long as you choose the right deployment pattern for your Spring Batch jobs and Kubernetes jobs.

Batch Processing on Kubernetes

Anyone here have experience about batch processing (e.g. spring batch) on kubernetes ? Is it good idea ? How to prevent batch processing process same data if we use kubernetes auto scaling feature ? Thank you.
Anyone here have experience about batch processing (e.g. spring batch) on kubernetes ? Is it good idea ?
For Spring Batch, we (the Spring Batch team) do have some experience on the matter which we share in the following talks:
Cloud Native Batch Processing on Kubernetes, by Michael Minella
Spring Batch on Kubernetes, by me.
Running batch jobs on kubernetes can be tricky:
pods may be re-scheduled by k8s on different nodes in the middle of processing
cron jobs might be triggered twice
etc
This requires additional non-trivial work on the developer's side to make sure the batch application is fault-tolerant (resilient to node failure, pod re-scheduling, etc) and safe against duplicate job execution in a clustered environment.
Spring Batch takes care of this additional work for you and can be a good choice to run batch workloads on k8s for several reasons:
Cost efficiency: Spring Batch jobs maintain their state in an external database, which makes it possible to restart them from the last save point in case of job/node failure or pod re-scheduling
Robustness: Safe against duplicate job executions thanks to a centralized job repository
Fault-tolerance: Retry/Skip failed items in case of transient errors like a call to a web service that might be temporarily down or being re-scheduled in a cloud environment
I wrote a blog post in which I explain all these aspects in details with code examples. You can find it here: Spring Batch on Kubernetes: Efficient batch processing at scale
How to prevent batch processing process same data if we use kubernetes auto scaling feature ?
Making each job process a different data set is the way to go (a job per file for example). But there are different patterns that you might be interested in, see Job Patterns from k8s docs.

Spring Cloud Data Flow: versioned streams

I'm implementing a stream pipe with Spring Cloud Data Flow.
My problem is that I configured MANUALLY the pipe (e.g. http | log_sink) in the server and it will be lost if I reset that server (think in an Amazon EC2 instance that can be hard reseted).
Which is the suggested way to keep versioning of pipes using SCDF?
Thanks.
I am summarizing the discussion from the comments.
To automate the promotion of Stream/Task workloads from lower to higher-level environments, the recommended approach would be the use of SCDF's Java DSL. With this, users can programmatically register, create, deploy, or launch stream/task in a repeatable manner and across many different platforms simultaneously (if there's a need for it). The Boot App built with the Java DSL can be versioned in Git, and it can be CD/GitOps friendly. With sufficient generalization to this App, it can also be reused by many different teams by overriding the defaults.
We put this for use in the product proper for or IT and Acceptance tests, which run on every upstream commit daily across multiple Kubernetes and Cloud Foundry installations.
Alternatively, all of the register, create, deploy, or launch stream/task commands can also be dumped in a text or a property file. Once when you have the file, the dataflow:>script --file command can help slurp in all the commands in each of the new environments — see docs.

ETL Spring batch, Spring cloud data flow (SCDF)

We have a use case where data can be sourced from different sources (DB, FILE etc) and transformed and stored to various sinks (Cassandra, DB or File).We would want the ability to split the jobs and do parallel loads - looks like Spring Batch RemoteChunking provides that ability.
I am new to SCDF and Spring batch and wondering what is the best way to use it.
Is there a way to provide configuration for these jobs (source connection details, table and query) and can this be done through an UI (SCDF Server UI ?). Is it possible to compose the flow?
This will run on Kubernetes and our applications are deployed through Jenkins pipeline.
We would want the ability to split the jobs and do parallel loads - looks like Spring Batch RemoteChunking provides that ability.
I don't think you need remote chunking, you can rather run parallel jobs, where each job handles an ETL process (for a particular file, db table).
Is there a way to provide configuration for these jobs (source connection details, table and query)
Yes, those can be configured like any regular Spring Batch job is configured.
and can this be done through an UI (SCDF Server UI ?
If you make them configurable through properties of your job, you can specify them through the UI when you run the task.
Is it possible to compose the flow?
Yes, this is possible with Composed Task.

Spring Batch and Pivotal Cloud Foundry [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
We are evaluating Spring Batch framework to replace our home grown batch framework in our organization and we should be able to deploy the batch in Pivotal Cloud Foundry (PCF). In this regard, can you let us know your thoughts on the issue below:
Let us say if we use Remote Partitioning strategy to process large volume of records, can the batch job auto scale Slave nodes in the cloud based on the amount of that the batch job processes? Or we have to scale appropriate number of Slave nodes and keep them in place before the batch job kicks-off?
How does the "grid size" parameter configuration in the scenario above?
You have a few questions here. However, before getting into them, let me take a minute and walk through where batch processing is on PCF right now and then get to your questions.
Current state of CF
As of PCF 1.6, Diego (the dynamic runtime within CF) provided a new primitive called Tasks. Traditionally, all applications running on CF were expected to be long running processes. Because of this, in order to run a batch job on CF, you'd need to package it up as a long running process (web app usually) and then deploy that. If you wanted to use remote partitioning, you'd need to deploy and scale slaves as you saw fit, but it was all external to CF. With Tasks, Diego now supports short lived processes...aka processes that won't be restarted when they complete. This means that you can run a batch job as a Spring Boot über jar and once it completes, CF won't try to restart it (that's a good thing). The issue with 1.6 is that an API exposing Tasks was not available so it was only an internal construct.
With PCF 1.7, a new API is being released to expose Tasks for general use. As part of the v3 API, you'll be able to deploy your own apps as Tasks. This allows you to launch a batch job as a task knowing it will execute, then be cleaned up by PCF. With that in mind...
Can the batch job auto scale Slave nodes in the cloud based on the amount of that the batch job processes?
When using Spring Batch's partitioning capabilities, there are two key components. The Partitioner and the PartitionHandler. The Partitioner is responsible for understanding the data and how it can be divided up. The PartitionHandler is responsible for understanding the fabric in which to distribute the partitions to the slaves.
For Spring Cloud Data Flow, we plan on creating a PartitionHandler implementation that will allow users to execute slave partitions as Tasks on CF. Essentially, what we'd expect is that the PartitionHandler would launch the slaves as tasks and once they are complete, they would be cleaned up.
This approach allows the number of slaves to be dynamically launched based on the number of partitions (configurable to a max).
We plan on doing this work for Spring Cloud Data Flow but the PartitionHandler should be available for users outside of that workflow as well.
How does the "grid size" parameter configuration in the scenario above?
The grid size parameter is really used by the Partitioner and not the PartitionHandler and is intended to be a hint on how many workers there may be. In this case, it could be used to configure how many partitions you want to create, but that really is up to the Partitioner implementation.
Conclusion
This is a description of how a batch workflow on CF would look like. It's important to note that CF 1.7 is not out as of the writing of this answer. It is scheduled to be out Q1 of 2016 and at that time, this functionality will follow shortly afterwards.