Spring-batch: Repeat steps in chunk oriented processing - spring-batch

I'm wondering whether it is possible, to configure a job in a way, that I could repeat several chunked-steps until the whole data is processed?
Background is, that I need to work on some real big data and while processing it, there's a risk of unwanted aborts. To prevent restarting from scratch over-and-over again, I'd like to do some partitioning of the data, that could be used to loop over the chunked steps.
Due to the given data, it is unfortunately not possible to make use of the spring-batch restartable-job feature to reach my goal.
My source database consists of several more-or-less loose connected tables, each of them is processed in its own step. So I have something like:
... omitting job-configuration ...
<batch:step="A" next="B">
<batch:tasklet>
<batch:chunk reader="readerA" writer="writerA" commit-interval="1000" />
</batch:tasklet>
</batch:step>
<batch:step="B" next="C">
<batch:tasklet>
<batch:chunk reader="readerB" writer="writerB" commit-interval="5000" />
</batch:tasklet>
</batch:step>
... some more steps with similar set-up...
Each reader has it's own SQL-Statement, to get the necessary data from the source-db and will write the result in another table of the target-db.
Now, my idea would be to, to adapt those SQLs in a way, that the data will be partitioned into some disjoint but consistent(*) parts, so that I could repeat the processing using the chunked steps as before. Maybe only adding some "parent-step" to control whether the loop has to be ended.
(*) By "disjoint but consistent" I mean, that although the data in the different steps is fetched from different tables, there are dependencies. For example, fetching the data to be processed for step B would do a join with table A, choosing only sets which were successfully processed.
Thanx for any advices!
/Andreas

Since there are dependencies between tables, I don't think going parallel is appropriate. Going parallel makes sense when partitions are independent from each other.
Your current setup should allow you to restart your job from where it left off at two levels:
In between steps: If the job fails at step B, step A would not be re-executed
Within each step: If the job fails in the middle of step A, it will restart from the last successfully committed chunk of step A.
You need to make sure to use a persistent job repository and restart the same job instance in case of failure (using the same identifying job parameters as the previous run).

Related

Spring batch job is thread-safe?

I need to parallelize a single step of a batch spring job. Before the step to be parallelized, tasklets are run that put some results in the parameters of the job.
The results produced by the tasklets, are necessary to execute the Partitioner and the Items of the step to be parallelized.
A doubt is arising that I really can't solve. Since I can have the same job running simultaneously multiple times with different initial parameters, are the tasklets and step items safe thread-safe?
No, tasklets and chunk-oriented step components are not thread-safe. If they are shared between multiple job instances/executions running concurrently, you need to make them thread-safe.
You can achieve this by using JobScoped steps and StepScoped readers/writers. You can also use the SynchronizedItemStreamReader and the (upcoming) SynchronizedItemStreamWriter to make readers and writers thread-safe. All item readers and writers provided by Spring Batch have a mention about their thread-safety in the Javadoc.
You do not want to run multiple instances of the same job. It would be better to run multiple tasks or processes in the same step and or job. You might want to lookup job partitioning, and or Remote Chucking to do concurrent processing.
If it has to be isolated jobs then you might have your concurrent jobs write out to say a message que as their end (writer) step, and then have another job listen to read from that que.
https://docs.spring.io/spring-batch/2.1.x/cases/parallel.html

Queues: How to process dependent jobs

I am working on an application where multiple clients will be writing to a queue (or queues), and multiple workers will be processing jobs off the queue. The problem is that in some cases, jobs are dependent on each other. By 'dependent', I mean they need to be processed in order.
This typically happens when an entity is created by the user, then deleted shortly after. Obviously I want the first job (i.e. the creation) to take place before the deletion. The problem is that creation can take a lot longer than deletion, so I can't guarantee that it will be complete before the deletion job commences.
I imagine that this type of problem is reasonably common with asynchronous processing. What strategies are there to deal with it? I know that I can assign priorities to queues to have some control over the processing order, but this is not good enough in this case. I need concrete guarantees.
This may not fit your model, but the model I have used involves not providing the deletion functionality until the creation functionality is complete.
When Create_XXX command is completed, it is responsible for raising an XXX_Created event, which also gets put on the queue. This event can then be handled to enable the deletion functionality, allowing the deletion of the newly created item.
The process of a Command completing, then raising an event which is handled and creates another Command is a common method of ensuring Commands get processed in the desired order.
I think an handy feature for your use case is Job chaining:
https://laravel.com/docs/5.5/queues#job-chaining

Jobs in the queue(pub-sub) distributed systems with dependencies?

How to approach a problem when there are jobs put in the queue(pub-sub) distributed systems, and they have a dependency between them.
For e.g. current state of the queue:
j3 -> j2 -> j1
rear front
j3 depends on the completion of j1.
The queue processor is consuming these jobs and started processing it in a distributed environment.
Based on some dependency resolution mechanism, dependency between j1 and j3 was found out.
Now, what I don't know is, the best way to deal with situation:
should I put j3 back in the queue, and again pick it up at the
later stage so that j1 would have completed by that time?
should I have some other mechanism - database to check if all the
j3 dependencies have met and then process j3?
Any help would be appreciated.
Thanks!
Having a job scheduler that's aware that these jobs are at the front of the queue, but are waiting on some dependencies, is the best way. That way, you can get other jobs done while waiting for the dependencies to finish, but still process them as much in order as possible.
Pushing items back onto the start of the queue is a good workaround, if it's relatively cheap to do so, if the queue length is relatively short and if there are quite few dependencies. If the item you push to the back is also a dependency of other tasks, they too need to be pushed to the back of the queue when they arrive at the front (or at once, but that's unnecessarily hard). If the queue length is long, you could see unexpected delays. For example, if the queue is a day long, you could end up waiting days for a task to finish. If that task is part of a chain of dependencies, the problem grows.
Either way, you're going to need to know if a task is queued/running/finished. You could store this information in your favourite database or use some gossip protocol or whatever you like. If it's not a correctness problem if the same job is executed twice, you can use an AP system (in the CAP sense, with eventual consistency, such as a gossip protocol). If running the same task twice is going to mess things up badly, you'll need some consensus mechanism, like a single source of truth, such as your favourite sql database or maybe couchbase.

How does Spring Batch transaction management work?

I'm trying to understand how Spring Batch does transaction management. This is not a technical question but more of conceptual one: what approach does Spring Batch use and what are the consequences of that approach?
Let me try to clarify this question a bit. For instance, looking at the TaskletStep, I see that generally a step execution looks something like this:
several JobRepository transactions to prepare the step metadata
a business transaction for every chunk to process
more JobRepository transactions to update the step metadata with the results of chunk processing
This seems to make sense. But what about a failure between 2 and 3? This would mean the business transaction was committed but Spring Batch was unable to record that fact in its internal metadata. So a restart would reprocess the same items again even though they have already been committed. Right?
I'm looking for an explanation of these details and the consequences of the design decisions made in Spring Batch. Is this documented somewhere? The Spring Batch reference guide has very few details on this. It simply explains things from the application developer's point of view.
There are two fundamental types of steps in Spring Batch, a Tasklet Step and a chunk based step. Each has it's own transaction details. Let's look at each:
Tasklet Based Step
When a developer implements their own tasklet, the transactionality is pretty straight forward. Each call to the Tasklet#execute method is executed within a transaction. You are correct in that there are updates before and after a step's logic is executed. They are not technically wrapped in a transaction since rollback isn't something we'd want to support for the job repository updates.
Chunk Based Step
When a developer uses a chunk based step, there is a bit more complexity involved due to the added abilities for skip/retry. However, from a simple level, each chunk is processed in a transaction. You still have the same updates before and after a chunk based step that are non-transactional for the same reasons previously mentioned.
The "What if" scenario
In your question, you ask about what would happen if the business logic completed but the updates to the job repository failed for some reason. Would the previously updated items be re-processed on a restart. As in most things, that depends. If you are using stateful readers/writers like the FlatFileItemReader, with each commit of the business transaction, the job repository is updated with the current state of what has been processed (within the same transaction). So in that case, a restart of the job would pick up where it left off...in this case at the end, and process no additional records.
If you are not using stateful readers/writers or have save state turned off, then it is a bit of buyer beware and you may end up with the situation you describe. The default behavior in the framework is to save state so that restartability is preserved.

Retry failed writing operations without delaying other steps in Spring Batch application

I am maintaining a legacy application written using Spring Batch and need to tweak it to never lose data.
I have to read from various webservice (one for each step) and then write to a remote database. Things goes bad when connection with the DB drops because all itens read from webservice are discarded (can't read the same item twice), and the data is lost because can not be written.
I need to setup Spring Batch to keep already read data on one step to retry the writing operation next time the step runs. The same step can not read more data until the write operation is successfully concluded.
When not being able to write, the step should keep the read data and pass execution to the next step, after a while, when it's time to the failed step to run again, it should not read another item, retrying the failed writing operation instead.
The batch application should runs in an infinite loop and each step should gather data from one different source. Failed writing operations should be momentarily skipped (keeping the read data) to not delay others steps but should resume from the write operation next time they are called.
I am researching in various web sources aside from official docs, but Spring Batch hasn't the most intuitive docs I have come across.
Can this be achieved? If yes, how?
You can write the data you need to persist in case the job fails to the Batch Step's ExecutionContext. You can restart the job again with this data:
Step executions are represented by objects of the StepExecution class.
Each execution contains a reference to its corresponding step and
JobExecution, and transaction related data such as commit and rollback
count and start and end times. Additionally, each step execution will
contain an ExecutionContext, which contains any data a developer needs
persisted across batch runs, such as statistics or state information
needed to restart
More from: http://static.springsource.org/spring-batch/reference/html/domain.html#domainStepExecution
I do not know if this will be ok with you, but here are my thoughts on your configuration.
Since you have two remote sources that are open to failure, let us partition the overall system with two jobs (not two steps)
JOB A
Step 1: Tasklet
Check a shared folder for files. If files exist, do not proceed to the next step. Will be more understandable when writing about JOB B
Step 2: Webservice to files
Read from your web service and write results to flatfiles in the shared folder. Since you would be using flatfiles for the output, you will solve your "all items read from webservice are discarded and the data is lost because can not be written."
Use Quartz or equivalent for the scheduling of this job.
JOB B
Poll the shared folder for generated files and create a joblauncher with the file (file.getWhere as a jobparameter). Spring integration project may help in this polling.
Step 1:
Read from the file, write them to remote db and move/delete file if writing to db is successful.
No scheduling will be needed since job launching originates from polled in files.
Sample Execution
Time 0: No file in the shared folder
Time 1: Read from web service and write to shared folder
Time 2: Job B file polling occurs, tries to write to db.
If successfull, the system continues to execute.
If not, when Job A tries to execute on its scheduled time, it will skip reading from web service since files still exist in the shared folder. It will skip until Job B consumes the files.
I did not want to go into implementation specifics but Spring Batch can handle all of these situations. Hope that this helps.