Spring Batch architecture - spring-batch

Hi
I am a novice in Spring Batch world and last days I've spent time watching Michael Minella's youtube video, read some documentation and successfully run some demo projects I found on the internet. I think Spring Batch is a hot candidate for our needs. But here is our story.
I am working in a company that developed their own scheduling and batch framework, for more than a decade ago, for their business department. The framework is capable of running DB stored procs, DB functions and dynamic SQLs. Needless to say it is very challenging to maintain it since too many people with various development skills did the coding and they don't work here anymore. Our framework may handle jobs and steps to run sequentially as well as async (as Spring Batch). We have also a Job Repository where we store whole job definitions (users create new jobs via GUI), job instances with its context (in case the server goes down, when server is up it will resume running a job).
My questions are following:
Can we create new Spring Batch jobs dynamically (either via XML og code) and via standard SB interfaces store them to the JobRepository DB?
Today, at certain time period, we have up to hundred of job executions simultaneously. They are also reusing a connection pool to the DB. Older Spring Batch ref documentation states JobFactory will create fresh ApplicationContext for each job execution. How can we achieve reusing connection pools if this is the case in Spring Batch.
I know there is a support for continuing failed steps but what if the server/app goes down, will I be able to restart my app and retrieve job instance with its context from JobRepository in order to continue from failed step?
Can a "step1.1" in "job1" be dependent on "step 2.1" from "job2" finishing within last hour? In such scenarios I may be using a step listener on "step1.1" to accomplish this?
Kind regards
Toto

You have a lot of material here to cover, so let me respond one point at a time:
Can we create new Spring Batch jobs dynamically (either via XML or code) and via standard SB interfaces store them to the JobRepository DB?
Can you generate a job definition dynamically? Yes. We do it in Spring XD with regards to the job orchestration piece (the composed job DSL is used to generate an XML file for example.
Does Spring Batch provide facilities to do this? No. You'd have to code it yourself.
Also note that you'd have to store the definition in your own table (the schema defined by Spring Batch doesn't have a table for this).
Today, at certain time period, we have up to hundred of job executions simultaneously. They are also reusing a connection pool to the DB. Older Spring Batch ref documentation states JobFactory will create fresh ApplicationContext for each job execution. How can we achieve reusing connection pools if this is the case in Spring Batch.
You can use parent/child context configurations to reuse beans including a DataSource. Define the DataSource in the parent and then the jobs that depend on it in child contexts.
I know there is a support for continuing failed steps but what if the server/app goes down, will I be able to restart my app and retrieve job instance with its context from JobRepository in order to continue from failed step?
This is really an orchestration concern. Spring Batch, by design, does not address the orchestration of jobs into consideration. This allows you to orchestrate them how you want.
The way I'd recommend handling this is via Spring XD or (depending on your timelines) Spring Cloud Data Flow. These tools provide orchestration capabilities including the redeployment of a job if it goes down. That being said, it won't restart a job that was running if it fails because that typically requires some form of human decision based on use case. However, Spring XD currently (and Spring Cloud Data Flow will) have the capabilities to implement something like this in a pretty straight forward way.
Can a "step1.1" in "job1" be dependent on "step 2.1" from "job2" finishing within last hour? In such scenarios I may be using a step listener on "step1.1" to accomplish this?
In cases like this, I'd start to question how your job is configured. You can use a JobExecutionDecider to decide if a step should be executed or not if it still makes sense.
All things considered, while you can accomplish most of what you're looking for with Spring Batch, using something like Spring XD or Spring Cloud Data Flow will make your life a lot easier.

Can we create new Spring Batch jobs dynamically (either via XML og code) and via standard SB interfaces store them the JobRepository DB?
It is easy to use StepBuilderFactory, FlowBuilder etc. to programatically build the Spring Batch artifacts. You'll probably want to back those artifacts with Spring Beans (to get nice facilities like the step/job spring scopes, injection and so on) and for that you can use prototype, execution scoped and job scoped beans, or even use facilities such as BeanDefinitionBuilder to dynamically create beans.
Older Spring Batch ref documentation states JobFactory will create fresh ApplicationContext for each job execution. How can we achieve reusing connection pools if this is the case in Spring Batch.
The GenericApplicationContextFactory creates a child application context. You can have the "global" beans in the parent application context.
I know there is a support for continuing failed steps but what if the server/app goes down, will I be able to restart my app and retrieve job instance with its context from JobRepository in order to continue from failed step?
Yes, but not that easily.
Can a "step1.1" in "job1" be dependent on "step 2.1" from "job2" finishing within last hour? In such scenarios I may be using a step listener on "step1.1" to accomplish this?
A JobExecutionDecider will likely be the best option there.

Related

Understanding JobLauncherTestUtils

I am currently getting to understand jobLauncherTestUtils. I have read about it from multiple resources such as following:
https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/test/JobLauncherTestUtils.html
https://livebook.manning.com/concept/spring/joblaunchertestutils
I wanted to understand when we call jobLauncherTestUtils.launchJob(), what does it mean by end-to-end testing of job. Does it actually launch the job. If so, then what's the point of testing the job without mocks? If not so, then how does it actually tests a job?
I wanted to understand when we call jobLauncherTestUtils.launchJob(), what does it mean by end-to-end testing of job.
End-to-End testing means testing the job as a black box based on the specification of its input and output. For example, let's assume your batch job is expected to read data from a database table and write it to a flat file.
And end-to-end test would:
Populate a test database with some sample records
Run your job
Assert that the output file contains the expected records
Without individually testing the inner steps of this job, you are testing its functionality from end (input) to end (output).
JobLauncherTestUtils is a utility class that allows you to run an entire job like this. It also allows you to test a single step from a job in isolation if you want.
Does it actually launch the job.
Yes, the job will be run as if it was run outside a test. JobLauncherTestUtils is just an utility class that uses a regular JobLauncher behind the scene. You can run your job in unit tests without this utility class.
If so, then what's the point of testing the job without mocks?
The point of testing a job without mocks is to ensure the job is working as expected with real resources it depends on or interact with. You can always mock a database or a message broker in your tests, but the mocking code could be buggy and does not reflect the real behaviour of a database or a message broker.

How should I slice and orchestrate a configurable batch network using Spring Batch and Spring Cloud Data Flow?

We would like to migrate the scheduling and sequence control of some Kettle import jobs from a proprietary implementation to a Spring Batch flavour, good practice implementation.
I intend to use Spring Cloud Data Flow (SCDF) server to implement and run a configurable sequence of the existing external import jobs.
The SCDF console Task editor UI seems promising to assemble a flow. So one Task wraps one Spring Batch, which in a single step only executes a Tasklet starting and polling the Carte REST API. Does this make sense so far?
Would you suggest a better implementation?
Constraints and Requirements:
The external Kettle jobs are triggered and polled using Carte REST API. Actually, it's one single Kettle job implementation, called with individual parameters for each entity to be imported.
There is a configurable, directed graph of import jobs for several entities, some of them being dependent on a correct import of the previous entity type. (e.g. Department, then Employee, then Role assignments...)
With the upcoming implementation, we would like to get
monitoring and controlling (start, abort, pause, resume)
restartability
easy reconfigurability of the sequence in production (possibly by GUI, or external editor)
possibly some reporting and statistics.
As my current understanding, this could be achieved by using Spring Cloud Data Flow (SCDF) server, and some Task / Batch implementation / combination.
Correct me if I'm wrong, but a single Spring Batch job with its hardwired flow seems not very suitable to me. Or is there an easy way to edit and redeploy a Spring Batch with changed flow in production? I couldn't find anything, not even an easy to use editor for the XML representation of a batch.
Yes, I believe you can achieve your design goals using Spring Cloud Data Flow along with the Spring Cloud Task/Spring Batch.
The flow of multiple Spring Batch Jobs (using the Composed Task) can be managed using Spring Cloud Data Flow as you pointed from the other SO thread.
The external Kettle jobs are triggered and polled using Carte REST API. Actually, it's one single Kettle job implementation, called with individual parameters for each entity to be imported.
There is a configurable, directed graph of import jobs for several entities, some of them being dependent on a correct import of the previous entity type. (e.g. Department, then Employee, then Role assignments...)
Again, both the above can be managed as a Composed Task (with the composed task consisting of a regular task as well as Spring Batch based applications).
You can manage the parameters passed to each task/batch upon invocation via batch job parameters or task/batch application properties or simply command-line arguments.
With the upcoming implementation, we would like to get
monitoring and controlling (start, abort, pause, resume)
restartability
easy reconfigurability of the sequence in production (possibly by GUI, or external editor)
possibly some reporting and statistics.
Spring Cloud Data Flow helps you achieve these goads. You can visit the Task Developer Guide and the Task Monitoring Guide for more info.
You can also check the Batch developer guide from the site as well.

In spring batch, what is a scope of an ItemReader without scope="..."?

If I have a web application, with an application-context that loads everything for my webapp and all my jobs configuration files, and if I have in a job a simple ItemReader without scope="step", the reader is a singleton, right ? So if I launch twice my job from a controller via a SimpleJobLauncher, I will use the same bean, right ? Unless I put scope="step", in order to have one bean per job execution ?
On the other hand, if I launch the job from a CommandLineJobRunner, I will have two distinct application contexts, so two different beans, right ?
Are my assertions valid ?
Thanks
Yes that is correct. Basically, every Bean-instance in a SpringContext is a singleton.
However, most readers or writers have a state. For instance, FlatFileItemReader can only run once, after that it points to the end of the file and its "close" method had been called. Therefore, if you simply start the job again, it will not work, since the FlatfileItemReader is closed.
For such cases, you will need to define them with sope=step.

Talend job batch processing

I am exploring Talend at work, I was asked if Talend supports batch processing as in running the job in multiple threads. After going through the user guide I understood threading is possible with sub jobs. I would like to know if it is possible to run the a job with a single action in parallel
Talend has excellent multi threading support. There are two basic methods for this. One method gives you more control and is implemented using components. The other method is implemented as job setting.
For the first method see my screenshot. I use tParallelize to load three files into three tables at the same time. Then when all three files are successfully loaded I use the same tParallelize to set the values of a control table. tParallelize can also be connected to tRunJob as easily as a subjob.
The other method is described very well here in Talend Help: Talend Help- Run Jobs in Parallel
Generally I recommend the first method because of the control it gives you, but if your job follows the simple pattern described in the help link, that method works as well.

How do I listen for, load and run user-defined workflows at runtime that have been persisted using SqlWorkflowInstanceStore?

The result of SqlWorkflowInstanceStore.WaitForEvents does not tell me what type of workflow is runnable. The constructor of WorkflowApplication takes a workflow definition, and at a minimum, I need to be able to store a workflow ID in the store and query it, so that I can determine which workflow definition to load for the WorkflowApplication.
I also don't want to create a SqlWorkflowInstanceStore for each custom workflow type, since there may be thousands of different workflows.
I thought about trying to use WorkflowServiceHost, but not every workflow has a Receive activity and I don't think it is feasible to have thousands of WorkflowServiceHosts running, each supporting a different workflow type.
Ideally, I just want to query the database for a runnable workflow, determine its workflow definition ID, load the appropriate XAML from a workflow definition table, instantiate WorkflowApplication with the workflow definition, and call LoadRunnableInstance().
I would like to have a way to correlate which workflow is related to a given HasRunnableWorkflowEvent raised by the SqlWorkflowInstanceStore (along with the custom workflow definition ID), or have an alternate way of supporting potentially thousands of different custom workflow types created at runtime. I must also load balance the execution of workflows across multiple application servers.
There's a free product from Microsoft that does pretty much everything you say there, and then some. Oh, and it's excellent too.
Windows Server AppFabric. No, not Azure.
http://www.microsoft.com/windowsserver2008/en/us/app-main.aspx
-Oisin