Spring Batch Execution Status Backed by Database - spring-batch

From the Spring Guides:
For starters, the #EnableBatchProcessing annotation adds many critical
beans that support jobs and saves you a lot of leg work. This example
uses a memory-based database (provided by #EnableBatchProcessing),
meaning that when it’s done, the data is gone.
How can I make the execution state backed by a database (or some other persistent record) so that, in case the application crashes, the job is resumed from the previous state?
My solution, until now, is having my ItemReader be an JdbcCursorItemReader which reads records from a table whose column X is not NULL, and my ItemWriter be a JdbcBatchItemWriter which updates the record with data on column X, making it non-null (so that it won't be picked on the next execution). However, this seems really hackish and I believe there's a more elegant way. Can anyone please shed some light?

When using the #EnableBatchProcessing annotation, if you provide a DataSource bean definition called dataSoure, Spring Batch will use that database for the job repository store instead of the in memory map. You can read more about this functionality in the documentation here: http://docs.spring.io/spring-batch/trunk/apidocs/org/springframework/batch/core/configuration/annotation/EnableBatchProcessing.html

Related

Can couchbase be used as the underlying JobRepository for spring-batch?

We have a requirement where we have to read a batch of a entitytype from the database, submit info about each entity to a service which will callback later with some data to update in the caller entity, save all the caller entities with the updated data. We thought of using spring-batch however we use Couchbase as our database which is eventually consistent and has no support for transactions.
I was going through the spring-batch documentation and I came across the Spring Batch Meta-Data ERD diagram here :
https://docs.spring.io/spring-batch/4.1.x/reference/html/index-single.html#metaDataSchema
With the above information in mind, my question is:
Can Couchbase be used as the underlying job-repository for spring-batch? What are the things I should keep in mind if its possible to use it? Any links to example implementations would be welcome.
The JobRepository needs to be transactional in order for Spring Batch to work properly. Here is an excerpt from the Transaction Configuration for the JobRepository section of the reference documentation:
The behavior of the framework is not well defined if the repository methods are not transactional.
Since Couchbase has no support for transactions as you mentioned, it is not possible to use it as an underlying datasource for the JobRepository.

Issue Insert/Update EF Core DbContext in Azure QueueTrigger Function (Multi-threading)

I´m getting PK Violation Exception when using EF Core 2.1 DbContext in an Azure QueueTrigger function. Guess is due to the nature of DbContext not being thread-safe, and the Azure Function running different instances in parallel. I have read quite a few, but I can´t find a good approach to solve this.
Here is my scenario (producer-consumer pattern):
I have a Scheduled Azure Function that is calling an API to get Projects from different external systems. To get all the required info for a project, I need to run different Queries to other external services, so I´m decoupling this to another Azure function, so the Scheduled function just queues a message per Project, as “Sync Project ID 101”.
Another QueueTrigger Function fires every time a message is queued, so, it means different instances running in parallel. This function must gather all the data of a specific Project, and that means more calls to other external services / APIs, to (some kind of) aggregate all the info about a Project. IMHO it´s good to do it that way, as I can process multiple Projects in parallel, and I can scale the Function if I need it.
Once I have all this Project info, I want to persist it in a SQL DB using EF Core (and here comes the issue)
Project data includes Users in the Project, and each user have a specific GUID as PK (coming from the external system). That means I can have repeated Users IDs in different Function instances, and here is the problem, as when I try to persist User info in a SQL Table, I can get PK Duplication exception, as multiple Function instances can try to Insert the same User at the same time (when the instance A check if user exists, it gets False, but another instance B is actually adding this User, so when instance A tries the Insert, it fails).
Guess I can lock DbContext somehow, but not sure if is good, as I also have a website doing Queries to the SQL DB (read-only queries for now, but could be updates in future too).
Another idea could be to send the entire Project info to another Queue / Blob file, and have another function in Singleton mode that Insert the data into SQL.
I´ve created this project simplifying my scenario, but enough to reproduce the issue and understand the problem.
https://github.com/luismanez/queuetrigger-efcore-multithreading
Any other ideas or recommended approaches? (open to change the architecture if find something better)
Many thanks!
A "more easy" way could be to do some kind of upsert in the database. There is a sample of how to do that with EF Core: https://www.flexlabs.org/2018/02/adding-upsert-support-for-entity-framework-core

Spring batch job execution context and step execution context clarification needed

I am using spring batch 3.0.3 and need some clarification about not serializing job execution context and step execution context as this we have large object sets and we dont want to persist them in spring batch tables. Is there anyway we can just store short_context and not serialized object?
By default, no because the ExecutionContext provides the data required for restartability. If you must do this (I'd encourage a different design), you'd have to implement your own ExecutionContextDao.
That being said, I'd encourage you not to go this route and to store your large object somewhere else. Even a Spring bean that is a Map that you want to use as a cache that is not maintained by the framework would be a better option IMHO.

Spring batch - need to use an object in itemProcessor/itemWriter but not persist it

I need to access an object in both itemProcessor and itemWriter but I don't want to persist it in the executionContext. I would read this object in a pre-processing step.
What is the best way to do that?
So far what I have is - I put the object in the jobExecutionContext, then I set the scope of my itemProcessor to "step" and bind a property of the itemProcessor to "#{stepExecution.jobExecution.executionContext}". This does give me access to my object. But I am stuck at two issues with this solution:
When do I remove the object from the context so that it doesn't stay persisted, it has to be after all the items are done.
My object could be huge and it seems the column for the context is of size 2500.
Is this a good solution and if it is, how do I solve the two concerns mentioned above. And if not, is there a good way to do this in spring batch or is caching the best way to go?
Thanks.
execution/job/step ... Context uses by Spring batch are meant to be persisted in the metadata of spring batch for the restartable feature to name one!
What i have done previously is creating a normal spring bean with the object yo need and simply #autowired it in your Processor and writer!
Job Done.

Create new or update existing entity at one go with JPA

A have a JPA entity that has timestamp field and is distinguished by a complex identifier field. What I need is to update timestamp in an entity that has already been stored, otherwise create and store new entity with the current timestamp.
As it turns out the task is not as simple as it seems from the first sight. The problem is that in concurrent environment I get nasty "Unique index or primary key violation" exception. Here's my code:
// Load existing entity, if any.
Entity e = entityManager.find(Entity.class, id);
if (e == null) {
// Could not find entity with the specified id in the database, so create new one.
e = entityManager.merge(new Entity(id));
}
// Set current time...
e.setTimestamp(new Date());
// ...and finally save entity.
entityManager.flush();
Please note that in this example entity identifier is not generated on insert, it is known in advance.
When two or more of threads run this block of code in parallel, they may simultaneously get null from entityManager.find(Entity.class, id) method call, so they will attempt to save two or more entities at the same time, with the same identifier resulting in error.
I think that there are few solutions to the problem.
Sure I could synchronize this code block with a global lock to prevent concurrent access to the database, but would it be the most efficient way?
Some databases support very handy MERGE statement that updates existing or creates new row if none exists. But I doubt that OpenJPA (JPA implementation of my choice) supports it.
Event if JPA does not support SQL MERGE, I can always fall back to plain old JDBC and do whatever I want with the database. But I don't want to leave comfortable API and mess with hairy JDBC+SQL combination.
There is a magic trick to fix it using standard JPA API only, but I don't know it yet.
Please help.
You are referring to the transaction isolation of JPA transactions. I.e. what is the behaviour of transactions when they access other transactions' resources.
According to this article:
READ_COMMITTED is the expected default Transaction Isolation level for using [..] EJB3 JPA
This means that - yes, you will have problems with the above code.
But JPA doesn't support custom isolation levels.
This thread discusses the topic more extensively. Depending on whether you use Spring or EJB, I think you can make use of the proper transaction strategy.