Spring Batch Partitiioning DBtoFile Java Configuration Example - spring-batch

I am currently working on Spring Boot and Spring Batch application to read 200,000 records from Database, process it and generate XML output.
I wrote single threaded Spring Batch program which uses JDBCPagingItemReader to read batch of 10K records from Database and StaxEventItemReader to generate this output. Total process is taking 30 minutes. I am wanting to enhance this program by using Spring Batch local Partitioning. Could anyone share Java configuration code to do this task of Spring Batch partitioning which will split processing into multi thread + multi files.. I tried to multi thread java configuration but StaxEventItemReader is single thread so it didn't work. Only way I see is Partition.
Appreciate help.

You are correct that partitioning is the way to approach this problem. I don't have an example of JDBC to XML of how to configure a partitioned batch job, but I do have one that is CSV to JDBC in which you should be able to just replace the ItemReader and ItemWriter with the ones you need (JdbcPagingItemReader and StaxEventItemWriter respectively). This example actually uses Spring Cloud Task to launch the workers as remote processes, but if you replace the partitionHandler with the TaskExecutorPartitionHandler (instead of the DeployerPartitionHandler as configured), that would execute the partitions internally as threads.
https://github.com/mminella/S3JDBC

Related

How to disable the saving of logs in Spring batch metadata tables?

Currently Spring batch job is running for every 20 seconds and there are 3 jobs run concurrently. So in effect there is an abrupt increase of the size of the Spring batch metadata tables below. So is there a way we can disable this? If not then how we can clean up in this table from time to time?
BATCH_JOB_INSTANCE,
BATCH_JOB_EXECUTION,
BATCH_JOB_EXECUTION_PARAMS,
and BATCH_STEP_EXECUTION
The RemoveSpringBatchHistoryTasklet can be used in a spring batch job that you can schedule to run periodically to purge the spring batch working tables.
See https://github.com/arey/spring-batch-toolkit

Difference between Spring Cloud Task and Spring Batch?

I went through the Introducing Spring Cloud Task, but things are not clear for the following questions.
I'm using Spring Batch
What's the use of Spring Cloud Task when we already have the metadata provided by Spring Batch ?
We're planning to use Spring Cloud Data Flow to monitor the Spring Batch. All the batch jobs can be imported into the SCDF as task and can be scheduled there, but don't see support for MongoDB. Hope MySQL works well.
What is the difference between Spring Cloud Task and Spring Batch?
Spring Cloud Task has a broader scope than Spring Batch. It is designed for any short lived task, including but not limited to (Spring) Batch jobs. A short lived task could be a Java process, a shell script, a Docker container, etc. Spring Cloud Task has its own meta-data tables to track the progress/status/stats of tasks.
In the context of Spring Batch, Spring Cloud Task provides a number of additional features:
Batch informational messages: ability to emit messages based on Spring Batch listeners events. Those messages can be consumed by streaming apps and make it possible to bridge tasks and streaming apps.
DeployerPartitionHandler: an additional partition handler that is suitable to cloud environments to dynamically deploy workers in a remote partitioning setup.

How to wire hypersonic db as a spring batch configuration for persisting jobs in java

I would like to use Hypersonic in memory DB to persist jobs as I need to run the same job multiple times on separate threads. The SimpleJobRepository can not be used as I run into Optimistic locking issue.
Does any one has a sample java configuration file and how to wire the hypersonic for spring batch job in java with annotation?

How do you distribute a spring batch job effectively across jvms?

In the job I read from a file and store something in a database.
I would like to have many running jars of the batch job in different processes and partition the data from the file among the running instances.
I would also like to be able to keep adding files to be processed and also distribute the reads from those.
I read spring xd might be a good fit, but can't find good tutorials on it.
YES I am also a noob of spring batch and xd.
The first thing to understand is how to remotely partition batch jobs. See the batch documentation for Spring Batch Integration and its support for remote partitioning, based on basic batch partitioning.
Spring XD provides out-of-the-box support for single-step partitioned work-loads.
You just have to import singlestep-partition-support.xml and provide partitioner and tasklet beans. See the XD Documentation for an example.

Spring batch- Parallel processing

I am running the spring batch job in three machines. For example the database has 30 records, the batch job in each machine has to pick up unique 10 records and process it.
I read partitioning and Parallel processing and bit confused, which one is suitable?
Appreciate your help.
What you are describing is partitioning. Partitioning is when the input is broken up into partitions and each partition is processed in parallel. Spring Batch offers two different ways to execute partitioning, one is local using threads (via the TaskExecutorPartitionHandler). The other one is distributing the partitions via messages so they can be executed either locally or remotely via the MessageChannelPartitionHandler found in Spring Batch Admin's spring-batch-integration project. You can learn more about remote partitioning via my talk on multi-jvm batch processing here: http://www.youtube.com/watch?v=CYTj5YT7CZU