JdbcPagingItemReader with SELECT FOR UDATE and SKIP LOCKED - postgresql

I have a multi instance application and each instance is multi threaded.
To make each thread only process rows not already fetched by another thread, I'm thinking of using pessimistic locks combined with skip locked.
My database is PostgreSQL11 and I use Spring batch.
For the spring batch part I use a classic chunk step (reader, processor, writer). The reader is a jdbcPagingItemReader.
However, I don't see how to use the pessimist lock (SELECT FOR UPDATE) and SKIP LOCKED with the jdbcPaginItemReader. And I can't find a tutorial on the net explaining simply how this is done.
Any help would be welcome.
Thank you

I have approached similar problem with a different pattern.
Please refer to
https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html#remoteChunking
Here you need to break job in two parts:
Master
Master picks records to be processed from DB and sent a chunk as message to queue task-queue. Then wait for acknowledgement on separate queue ack-queue once it get all acknowledgements it move to next step.
Slave
Slave receives the message and process it.
send acknowledgement to ack-queue.

Related

Scheduler Processing using Spring batch

we have a requirement to process millions of records using spring batch . We have planned to use a Spring Batch to do this by reading the db using JdbcPagingItemReaderBuilder and process in chunks and write it to Kaafka Queue. The active consumers of the queue will process the chunks of data and update the db
The consumer task is to iterate every item from the chunk and invoke the external api's.
In case the external system is down or not responding with success response , there should be retries of atleast 3 times and considering that each task in the chunk has to do this, what would be the ideal approach?
Another use case to consider, what happens when the job is processing and the system goes down and say that the job has already processed 10000 record and the remaining records are yet to be processed . After the restart how to make sure the execution doesnt restart the entire process from beginning and to resume from the point of failure.
Spring Batch creates the following tables. You can use them to check the status of your job and customize your scheduler to behave in a way you see fit.
I'd use the step execution Id in BATCH_STEP_EXCECUTION to validate the status that's set and then retry based off on that status, Or something similar to that sense.
BATCH_JOB_EXECUTION
BATCH_JOB_EXECUTION_CONTEXT
BATCH_JOB_EXECUTION_PARAMS
BATCH_JOB_INSTANCE
BATCH_STEP_EXECUTION

Quarkus Scheduled Records Processing mechanism Best Practice

What is the best practice or way to process the records from DB in scheduled.
Situation:
A Microservice based on Quarkus - responsible for sending a communication to customers.
DB Table Having Customers Records (100000 customers)
Microservice is running on multiple nodes (4 nodes)
Expectation:
There should be a scheduler that runs every 5 sec
Fetches the records from DB where employee status = pending
Should be Multithreaded architecture.
Send email to employee email.
Problem 1:
The same scheduler running on multiple nodes picks the same records and process How can we avoid this?
Problem 2:
Scheduler pics (100 records and processing it) and takes more than 5 seconds and scheduler run again pics few same records. How can we avoid that:
If you are planning to run your microservices on kubernetes I would sugest to use an external components as a scheduler and let this component distribute the work over your microservices using messages or HTTP invocations.
As responses to your questions here we go:
You can use some locking strategy or "reserve" each row including a field that indicates that your record is being processed and excluding all records containing this fields from your query. By this means when the scheduler fires it will read a set of rows not reserved and use a multithreading approach to process the records, by using a locking strategy (pesimits or optimist) you can prevent other records from marking the same row as reserved for them to be processed. After that the thread thas was able to commit the reserve process the records and updates the state or releases the "reserve" so other workers can work on the record if its needed.
You can always instruct your scheduler to do no execute if there is still an execution going.
#Scheduled(identity = "ProcessUpdateScheduler", every = "2s", concurrentExecution = Scheduled.ConcurrentExecution.SKIP)
You mainly have two approaches among other possible ones:
Pulling (Distribute mining or work distribution): Each instance of the microservice pick a random pending row and mark this row as "processing" commiting the transaction, if its able to commit then this instance holds the right to process this record continuing with its execution, if not it tries to retrieve a different row or just exists waiting for the next invocation. This approach scales horizontally because adding more workers will mean increasing your processing throughput.
Pushing (central distribution, distributed processing). You have two kinds of components: First the "Distributor" which is executed with the scheduler and is responsible for picking rows to be processed and marking then as "pending processing", this rows will be forward via a messaging system or HTTP call to the "Processor". The Processor component recieves as input a record and is responsible of processing this record completely or releasing the hold ("procesing pending") state.
Choouse the best suited for your scenario, if you go for the second option, you can have one or more distributors if its necessary, but in order to increment your processing throughput you only need to scale the "Processor" workers

SpringBoot batch listener mode vs non-batch listener mode

I am just curious does batch listener mode in Spring Kafka gives better performance than non-batch listener mode?
If we are handling exceptions then we still need to process each record in Batch-listener mode. Non-batch seems less error prone, stable and customizable .
Please share your views on this as I didn't find any good comparison.
It completely depends on what your listener is doing with the data.
If it processes each record in a loop then there is no benefit; you might as well just let the container iterate over the collection and send the listener one record at-a-time.
Batch mode will improve performance if you are processing the batch as a whole - e.g. a batch insert using JDBC in a single transaction.
This will often run much faster than storing one record at-a-time (using a new transaction for each record) because it requires fewer round trips to the DB server.

Asynchronous SQL procedure execution set and wait for completion

Say I have a large set of calls to a procedure to run which have varying parameters but are independent so I want to make parallel/async calls. I use the service broker to fire these all off but the problem I have is I want to know neat ways of knowing how to wait for them all to complete (or error).
Is there a way to do this? I believe I could just loop with waits on the result table checking for completion on that but that isn't very "event triggered". Hoping for a nicer way to do this.
I have used the service broker with queue code and processing based off this other answer: Remus' service broker queuing example
Good day Shiv,
There are several ways (like always) that you can use in order to implement this requirement. One of these is using this logic:
(1) Create two queues: one will be the trigger to execute the main SP that you want execute in Asynchronous, and the other will be the trigger to execute whatever you want to execute after all the executions ended.
(2) When you create the message in the first queue you should also create a message in the second queue, which will only tell us which execution did not ended yet (first queue gives the information which execution started since once we START the execution we use the message and remove it from the queue).
(3) Inside the SP that you execute using the main first queue (this part executed in synchronous):
(3.1) execute the queries you need
(3.2) clear the equivalent message from the second queue (meaning that this message will removed only after the queries ended)
(3.3) check if there are messages in the second queue. If there are no messages then all the tasks ended and you can execute your final step
** Theoretically instead of using the second queue, you can store data in a table, but using second queue should probably give better performance then updating table each time an execution ended. Anyhow, you test the option of using a table as well.

Retry failed writing operations without delaying other steps in Spring Batch application

I am maintaining a legacy application written using Spring Batch and need to tweak it to never lose data.
I have to read from various webservice (one for each step) and then write to a remote database. Things goes bad when connection with the DB drops because all itens read from webservice are discarded (can't read the same item twice), and the data is lost because can not be written.
I need to setup Spring Batch to keep already read data on one step to retry the writing operation next time the step runs. The same step can not read more data until the write operation is successfully concluded.
When not being able to write, the step should keep the read data and pass execution to the next step, after a while, when it's time to the failed step to run again, it should not read another item, retrying the failed writing operation instead.
The batch application should runs in an infinite loop and each step should gather data from one different source. Failed writing operations should be momentarily skipped (keeping the read data) to not delay others steps but should resume from the write operation next time they are called.
I am researching in various web sources aside from official docs, but Spring Batch hasn't the most intuitive docs I have come across.
Can this be achieved? If yes, how?
You can write the data you need to persist in case the job fails to the Batch Step's ExecutionContext. You can restart the job again with this data:
Step executions are represented by objects of the StepExecution class.
Each execution contains a reference to its corresponding step and
JobExecution, and transaction related data such as commit and rollback
count and start and end times. Additionally, each step execution will
contain an ExecutionContext, which contains any data a developer needs
persisted across batch runs, such as statistics or state information
needed to restart
More from: http://static.springsource.org/spring-batch/reference/html/domain.html#domainStepExecution
I do not know if this will be ok with you, but here are my thoughts on your configuration.
Since you have two remote sources that are open to failure, let us partition the overall system with two jobs (not two steps)
JOB A
Step 1: Tasklet
Check a shared folder for files. If files exist, do not proceed to the next step. Will be more understandable when writing about JOB B
Step 2: Webservice to files
Read from your web service and write results to flatfiles in the shared folder. Since you would be using flatfiles for the output, you will solve your "all items read from webservice are discarded and the data is lost because can not be written."
Use Quartz or equivalent for the scheduling of this job.
JOB B
Poll the shared folder for generated files and create a joblauncher with the file (file.getWhere as a jobparameter). Spring integration project may help in this polling.
Step 1:
Read from the file, write them to remote db and move/delete file if writing to db is successful.
No scheduling will be needed since job launching originates from polled in files.
Sample Execution
Time 0: No file in the shared folder
Time 1: Read from web service and write to shared folder
Time 2: Job B file polling occurs, tries to write to db.
If successfull, the system continues to execute.
If not, when Job A tries to execute on its scheduled time, it will skip reading from web service since files still exist in the shared folder. It will skip until Job B consumes the files.
I did not want to go into implementation specifics but Spring Batch can handle all of these situations. Hope that this helps.