How to design spring batch to avoid long queue of request and restart failed job - spring-batch

I am writing a project that will be generating reports. It would read all requests from a database by making a rest call, based on the type of request it will make a rest call to an endpoint, after getting the response it will save the response in an object and save it back to the database by making a call to an endpoint.
I am using spring-batch to handle the batch work. So far what I came up with is a single job (reader, processor, writer) that will do the whole things. I am not sure if this is the correct design considering
I do not want to queue up requests if some request is taking a long time to get a response back. [not sure yet]
I do not want to hold up saving response until all the responses are received. [using commit-internal will help]
If the job crashes for some reason, how can I restart the job [maybe using batch-admin will help but what are my other options]

By using chunk oriented processing Reader, Processor and Writer get executed in order until Reader has nothing to return.
If you can read one item at a time, process it and send it back to the endpoint that handles the persistence this approach is handy.
If you must read ALL the information at once the reader will get a big collection with all items and pass it to processor. The processor will process all the items and send the result to the writer. You cannot send just a few to the writer so you would have to do the persistence directly from processor and that would be against the design.
So, as I understand this, you have two options:
Design a reader that can read one item at a time. Use the chunk oriented processing that you already started to read one item, process it and send it back for persistence. Have a look at how other readers are implemented (like JdbcCursorItemReader).
You create a tasklet that reads the whole collection of items process it and sends them back for processing. You can break this in different tasklets.
commit-interval only controls after how many items transaction is commited. So it will not help you as all the processing and persistence is done by calling rest services.

I have figured out a design and I think it will work fine.
As for the questions that I asked, following are the answers:
Using asynchronous processors will help avoiding any queue.
http://docs.spring.io/spring-batch/trunk/reference/html/springBatchIntegration.html#asynchronous-processors
using commit-internal will solve it
This thread has the answer - Spring batch :Restart a job and then start next job automatically

Related

How to boot up a Microservice from events

Say I have a shop application and I want to make some complicated validations for that operation.
Events are the single source of truth in my system.
Adding a product is represented by a ProductAdded message.
The microservice responsible for validating the product reads a message, validates it, and produces a ProductValidated message.
But what happens if I want to microservice to boot up from zero?
On bootup, each and every message is reprocessed, resulting in a redundant and duplication of validation for each consumed message.
This could be solved by first reading all messages from the messaging queue and when all messages are loaded, start an asynchronous process of validation.
But how can it ensure that all messages are loaded? maybe messages are produced quicker than the process of building the state from events. A solution could be querying the messaging queue for the total number of messages at a given moment. Then, reading all of them and process them. Then, query and process again.
The problem with this one is that it doesn't seem to me like a typical solution for this challenge. I want to find out what is a popular practice to do in this situation.
You have few options:
A KTable, which you aggregate by shopping cart (each shopping cart cannot have the same product twice). To prevent this from growing too big, records need to be 'tombstoned', so another thing needs to tell the app that a shopping cart is no more.
Remember that to do any kind of aggregation in Kafka, you need local storage. If you don't want or cannot have local storage, Kafka is the wrong tool.
I don't fully understand your points under But there is a problem with the microservice validation process. First it says no caching or local storage, and the second point says load everything (which implies caching in local storage).
--- Edit
You can check is this example from Confluent that does a validation on orders: https://github.com/confluentinc/kafka-streams-examples/tree/5.4.1-post/src/main/java/io/confluent/examples/streams/microservices .
If I understand you correctly, you can have a local storage that doesn't have a changelog, so you can re-populate it on restart.
Check the class InventoryService.java, there you can see how to create a separe store. The line you want to omit is .withLoggingEnabled(), as that creates a changelog topic.
final StoreBuilder reservedStock = Stores
.keyValueStoreBuilder(Stores.persistentKeyValueStore(RESERVED_STOCK_STORE_NAME),
Topics.WAREHOUSE_INVENTORY.keySerde(), Serdes.Long())
builder.addStateStore(reservedStock);
The 2 other things you'll need to do are:
Configure the stream to go back to the ealiest record config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
Have a bit of code that finds the store and wipes it before you build the stream. Check this blog post by Confluent, the section Local State Stores, which explains part of how to find the directory where the local files are stored, so you can wipe the directory.

Spring batch single threaded reader and multi threaded writer

Tried to find if this was asked before but couldn't.
Here is the problem. The following has to be achieved via Spring batch
There is one file to be read and processed. The item reader is not thread safe.
The plan is to have multithreaded homogenous processors and multithreaded homogenous writers injest items read by a single threaded reader.
Kind of like below:
----------> Processor #1 ----------> Writer #1
|
Reader -------> Processor #2 ----------> Writer #2
|
----------> Processor #3 ----------> Writer #3
Tried AsyncItemProcessor and AsyncItemWriter, but holding debug point on processor resulted in reader not being executed until the point was released i.e. single threaded processing.
Task executor was tried like below:
<tasklet task-executor="taskExecutor" throttle-limit="20">
Multiple threads on the reader were launched.
Synchronising the reader also didn't work.
I tried to read about partitioner but it seemed complex.
Is there an annotation to mark the reader as single threaded? Would pushing read data to Global context be a good idea?
Please guide towards a solution.
I guess nothing is in built in Spring Batch API for the pattern that you are looking for. Coding on your part would be needed to achieve what you are looking for.
Method ItemWriter.write already takes a List of processed items based on your chunk size so you can divide up that List into as many threads as you like. You spawn your own threads and pass a segment of list to each of threads to write .
Problem is with method ItemProcesor.process() as it processes item by item so you are limited by a single item and you wouldn't be able to much of a threading for a single item.
So challenge is to write your own reader than can hand over a list of items to processor instead of a single item so you can process those items in parallel & writer will work on a list of list.
In all of this set up, you have to remember that threads spawned by you will be out of read - process - write transaction boundary of Spring batch so you will have to take care of that on your own - in terms of merging processed output for all threads and waiting till all threads are complete and handling any errors. All in all, its very risky.
Making a item reader to return a list instead single object - Spring batch
Came across this with a similar problem at hand.
Here's how I am doing it at the moment. As #mminella suggested, synchronized itemReader with the flatfileItemReader as delegate. This works with decent performance. The code writes about ~4K records per second at the moment but the speed doesn't entirely depend on the design, other attributes contribute as well.
Tried other approaches to increase performance, both kind of failed.
Custom Synchronized ItemReader that aggregates with FlatFileItemReader as delegate but I ended up with maintaining a lot a state that caused performance drop. Maybe the code needed optimization or Synchronization is just faster.
Fired each insert PreparedStatement batch in different thread but didn't increase much performance but I still am counting on this in case I run into an environment where individual threads for batches would result in significant performance boost.

Process messages from Azure in LIFO

I am using the Azure REST API to read messages from an Azure Queue using Peek-Lock Message. Is there any way I can read the last message that was posted in the queue rather than reading from a queue based mechanism (FIFO)?
Also, is there a faster way to process messages from Azure other than using the Peek-Lock Message REST API?
Thanks!
Is there any way I can read the last message that was posted in the
queue rather than reading from a queue based mechanism (FIFO)?
Using the REST API, unfortunately there's no way to process the last message first. You would have to implement something on your own. If you know that your queue can't have more than 32 messages at a time, you could possibly get all 32 messages in one go and sort them on the client side based on the message insertion time. Yet another (crazy) idea would be to create a new queue for each message and name the queue using the following pattern: "q"-(DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks). Now list queues and get only the 1st queue. This will give you the message you last inserted.
Also, is there a faster way to process messages from Azure other than
using the Peek-Lock Message REST API?
One possibility could be to fetch more than one messages from a queue and process them in parallel on the client side.

MSMQ as a job queue

I am trying to implement job queue with MSMQ to save up some time on me implementing it in SQL. After reading around I realized MSMQ might not offer what I am after. Could you please advice me if my plan is realistic using MSMQ or recommend an alternative ?
I have number of processes picking up jobs from a queue (I might need to scale out in the future), once job is picked up processing follows, during this time job is locked to other processes by status, if needed job is chucked back (status changes again) to the queue for further processing, but physically the job still sits in the queue until completed.
MSMQ doesn't let me to keep the message in the queue while working on it, eg I can peek or read. Read takes message out of queue and peek doesn't allow changing the message (status).
Thank you
Using MSMQ as a datastore is probably bad as it's not designed for storage at all. Unless the queues are transactional the messages may not even get written to disk.
Certainly updating queue items in-situ is not supported for the reasons you state.
If you don't want a full blown relational DB you could use an in-memory cache of some kind, like memcached, or a cheap object db like raven.
Take a look at RabbitMQ, or many of the other messages queues. Most offer this functionality out of the box.
For example. RabbitMQ calls what you are describing, Work Queues. Multiple consumers can pull from the same queue and not pull the same item. Furthermore, if you use acknowledgements and the processing fails, the item is not removed from the queue.
.net examples:
https://www.rabbitmq.com/tutorials/tutorial-two-dotnet.html
EDIT: After using MSMQ myself, it would probably work very well for what you are doing, as far as I can tell. The key is to use transactions and multiple queues. For example, each status should have it's own queue. It's fairly safe to "move" messages from one queue to another since it occurs within a transaction. This moving of messages is essentially your change of status.
We also use the Message Extension byte array for storing message metadata, like status. This way we don't have to alter the actual message when moving it to another queue.
MSMQ and queues in general, require a different set of patterns than what most programmers are use to. Keep that in mind.
Perhaps, if you can give more information on why you need to peek for messages that are currently in process, there would be a way to handle that scenario with MSMQ. You could always add a database for additional tracking.

Message queues and database inserts

I'm new to message queues and am intrigued by their capabilities and use. I have an idea about how to use it but wonder if it is the best use of this tool. I have an application that picks up and reads spreadsheets, transforms the data business objects for database storage. My application needs to read and be able to update several hundred thousand records, but I'm running into performance issues holding onto these objects and bulk inserting into the database.
Would having have two different applications (one to read the spreadsheets, one to store the records) using a message queue be proper utilization of a message queue? Obviously there are some optimizations I need to make in my code and is going to be my first step, but wanted to hear thoughts from those that have used message queues.
It wouldn't be an improper use of the queue, but its hard to tell if in you scenario adding a message queue will having any affect on the performance problems you mentioned. We would need more information.
Are you adding one message to a queue to tell a process to convert a spreadsheet and a second message when the data is ready for loading? or are you thinking of adding on message per data record? (That might get expensive fast, and probably won't increase the performance).