How do you distribute a spring batch job effectively across jvms? - spring-batch

In the job I read from a file and store something in a database.
I would like to have many running jars of the batch job in different processes and partition the data from the file among the running instances.
I would also like to be able to keep adding files to be processed and also distribute the reads from those.
I read spring xd might be a good fit, but can't find good tutorials on it.
YES I am also a noob of spring batch and xd.

The first thing to understand is how to remotely partition batch jobs. See the batch documentation for Spring Batch Integration and its support for remote partitioning, based on basic batch partitioning.
Spring XD provides out-of-the-box support for single-step partitioned work-loads.
You just have to import singlestep-partition-support.xml and provide partitioner and tasklet beans. See the XD Documentation for an example.

Related

Difference between Spring Cloud Task and Spring Batch?

I went through the Introducing Spring Cloud Task, but things are not clear for the following questions.
I'm using Spring Batch
What's the use of Spring Cloud Task when we already have the metadata provided by Spring Batch ?
We're planning to use Spring Cloud Data Flow to monitor the Spring Batch. All the batch jobs can be imported into the SCDF as task and can be scheduled there, but don't see support for MongoDB. Hope MySQL works well.
What is the difference between Spring Cloud Task and Spring Batch?
Spring Cloud Task has a broader scope than Spring Batch. It is designed for any short lived task, including but not limited to (Spring) Batch jobs. A short lived task could be a Java process, a shell script, a Docker container, etc. Spring Cloud Task has its own meta-data tables to track the progress/status/stats of tasks.
In the context of Spring Batch, Spring Cloud Task provides a number of additional features:
Batch informational messages: ability to emit messages based on Spring Batch listeners events. Those messages can be consumed by streaming apps and make it possible to bridge tasks and streaming apps.
DeployerPartitionHandler: an additional partition handler that is suitable to cloud environments to dynamically deploy workers in a remote partitioning setup.

Spring Batch Partitiioning DBtoFile Java Configuration Example

I am currently working on Spring Boot and Spring Batch application to read 200,000 records from Database, process it and generate XML output.
I wrote single threaded Spring Batch program which uses JDBCPagingItemReader to read batch of 10K records from Database and StaxEventItemReader to generate this output. Total process is taking 30 minutes. I am wanting to enhance this program by using Spring Batch local Partitioning. Could anyone share Java configuration code to do this task of Spring Batch partitioning which will split processing into multi thread + multi files.. I tried to multi thread java configuration but StaxEventItemReader is single thread so it didn't work. Only way I see is Partition.
Appreciate help.
You are correct that partitioning is the way to approach this problem. I don't have an example of JDBC to XML of how to configure a partitioned batch job, but I do have one that is CSV to JDBC in which you should be able to just replace the ItemReader and ItemWriter with the ones you need (JdbcPagingItemReader and StaxEventItemWriter respectively). This example actually uses Spring Cloud Task to launch the workers as remote processes, but if you replace the partitionHandler with the TaskExecutorPartitionHandler (instead of the DeployerPartitionHandler as configured), that would execute the partitions internally as threads.
https://github.com/mminella/S3JDBC

Spring batch partition or using java multi threading?

Need to design multi threading with Spring batch. Spring batch partition or using java multi threading, Which one is a better choice? We have many processes, each process holds jobs and sub jobs. these sub jobs needs to be executed in parallel.How can I do the retry mechanism in partition??
Go for the partition with master-slave concept. I have tried this and it boots the performance in good amount.
Restart Scenario :
Once your partitioner starts and your items are divided to the slaves.
Lets say you have 3 slaves and each slave holds 1 file to process.
Manually delete some items in the file which is assigned to the Slave2 so that it should get failed(Either in reader or writer of your slave step).
Then restart the job. Now it should start reading from the file which was assigned to the Slave2.

Spring batch MongoDB trade-offs vs Spring batch MySQL

I've used Spring Batch with MySQL before and the availability of Spring Batch Admin makes the starting, stopping, restarting of Jobs a lot easier. But my current company is considering to move to MongoDB from Derby database for obvious NoSQL DB benefits and also wants to move their existing messy batch application solutions to use Spring Batch framework. They also would like to use Spring Batch's Admin for managing the jobs.
Question:
What are the tradeoffs that we will have to make for using Spring Batch with MongoDB then Spring Batch with MySQL?
After doing a bit of research I've gathered the following trade-offs for using MongoDB with Spring batch
Since MongoDB does not support transactions, Spring Batch Admin will not work since the Admin requires the meta-data schema which is not available for MongoDB.
We will not be able to Stop, Start & restart jobs.
If a Step's writer tries committing 20 documents and commit for 1 document fails, the other 19 documents will not be rolled back automatically and will have to be managed by the system.
Can you please tell me if I am right with the above and if there are any other that I have not mentioned already.

Spring batch- Parallel processing

I am running the spring batch job in three machines. For example the database has 30 records, the batch job in each machine has to pick up unique 10 records and process it.
I read partitioning and Parallel processing and bit confused, which one is suitable?
Appreciate your help.
What you are describing is partitioning. Partitioning is when the input is broken up into partitions and each partition is processed in parallel. Spring Batch offers two different ways to execute partitioning, one is local using threads (via the TaskExecutorPartitionHandler). The other one is distributing the partitions via messages so they can be executed either locally or remotely via the MessageChannelPartitionHandler found in Spring Batch Admin's spring-batch-integration project. You can learn more about remote partitioning via my talk on multi-jvm batch processing here: http://www.youtube.com/watch?v=CYTj5YT7CZU