Spring-batch reader for frequently modified source - spring-batch

I'm using spring batch and I want to write a job where I have a JPA reader that selects paginated sets of products from the database. Then I have a processor that will perform some operation on every single product (let's say on product A), but performing this operation on product A the item processor will also process some other products too (like product B, product C, etc.). Then the processor will come to product B because it's in line and is given by the reader. But it has already been processed, so it's actually a waste of time/resources to process it again. How should one actually tackle this - is there a modification aware item reader in spring batch? One solution would be in the item processor to check if the product has already been processed, and only if it hasn't been then process it. However checking if the product has been process is actually very resource consuming.

There are two approaches here that I'd consider:
Adjust what you call an "item" - An item is what is returned from the reader. Depending on the design of things, you may want to build a more complex reader that can include the dependent items and therefore only loop through them once. Obviously this is very dependent upon your specific use case.
Use the Process Indicator pattern - The process indicator pattern is what this is for. As you process items, set a flag in the db indicating that they have been processed. Your reader's query is then configured to only read those that have been processed (filtering those out that were updated via the process phase).

Related

How to communicate between order placing system and trade matching system

Order Placing System
A user places an order and the corresponding amount is kept on hold for the order and an order is created. This order is then pushed to some queue to be used by trade matching system. User gets back a reference order id for the order placed in return to the API call.
Trade Matching System
The system feeds on data from the queue generated by order placing system and looks for possible match and if possible to execute, executes them and push to another queue.
User Notification System
The system fetches data from the executed queue and broadcasts it to the user it belonged to. User can also fetch status of the order from the reference id which was shared on first API call
These two systems are right now communicating indirectly via a queue. Now the requirement is, in order placing system, when a user places order, along with order id, we also need to return execution status (i.e. Whether it got executed or not, if yes, rate and fee charged etc).
What should be mode of communication between order placing and trade matching system to make it possible to return execution details in first api call itself ?
Challenges
Matching System being single threaded, we cannot merge it with order engine
Polling and waiting for execution from execution queue, will probably make our order placing API slow
Right now our order placing system and matching system are separate.
Just looking for possible solution and opinion. Please let me know if something is unclear.

Spring batch entire Job in transaction boundary

I have a use-case for which I could use spring batch job which I could design in following ways.
1) First Way:
Step1 (Chunk oriented step): Read from the file —> filter, validate and transform the read row into DTO (data transfer object), if there are any errors, store errors in DTO itself —> Check if any of the DTOs has errors , if not write to Database. If yes, write to an error file.
However, problem with this way is - I need this entire JOB in transaction boundary. So if there is a failure in any of the chunks then I don’t want to write to DB and want to rollback all successful writes till that point in DB. Above way forces me to write rollback logic for all successful writes if there is a failure in any of the chunks.
2) Second way
Step 1 (Chunk oriented step): Read items from the file —> filter, validate and transform the read row in DTO (data transfer object). This does store the errors in the DTO object itself.
Step 2 (Tasklet): Read entire list (and not chunks) of DTOs created from step 1 —> Check if any of the DTOs has errors populated in it. If yes, then abort the writing to DB and fail the JOB.
In second way, I get all benefits of chunk processing and scaling. At the same time I have created transaction boundary for entire job.
PS: In both ways in their first step there won’t be any step failure, if there is failure; errors are stored in DTO object itself. Thus, DTO object is always created.
Question is - Since I am new to Spring batch, is it a good pattern to go with second way. And is there a way that I can share data between steps so that entire List of DTOs is available to second step (in second way above) ?
In my opinion, trying to process the entire file in a single transaction (ie a transaction at the job level) is not the way to go. I would proceed in two steps:
Step 1: process the input and writes errors to the file
Step 2: this step is conditioned by step1. If no errors has been detected in step 1, then save data to the db.
This approach does not require to write data to the database and roll it back if there are errors (as suggested by option 1 in your description). It only writes to the database when everything is ok.
Moreover, this approach does not require holding a list of items in-memory as suggested by option 2, which could be inefficient in terms of memory usage and performs poorly if the file is big.

How doest Trello store generated actions from updates to other documents (boards, cards) in MongoDB without atomic transactions?

I'm developing a single page web app that will use a NoSQL Document Database (like MongoDB) and I want to generate events when I make a change to my entities.
Since most of these databases support transactions only on a document level (MongoDB just added ASIC support) there is no good way to store changes in one document and then store events from those changes to other documents.
Let's say for example that I have a collection 'Events' and a collection 'Cards' like Trello does. When I make a change to the description of a card from the 'Cards' collection, an event 'CardDescriptionChanged' should be generated.
The problem is that if there is a crash or some error between saving the changes to the 'Cards' collection and adding the event in the 'Events' collection this event will not be persisted and I don't want that.
I've done some research on this issue and most people would suggest that one of several approaches can be used:
Do not use MongoDB, use SQL database instead (I don't want that)
Use Event Sourcing. (This introduces complexity and I want to clear older events at some point, so I don't want to keep all events stored. I now that I can use snapshots and delete older events from the snapshot point, but there is a complexity in this solution)
Since errors of this nature probably won't happen too often, just ignore them and risk having events that won't be saved (I don't want that too)
Use an event/command/action processor. Store commands/action like 'ChangeCardDescription' and use a Processor that will process them and update the entities.
I have considered option 4, but a couple of question occurs:
How do I manage concurrency?
I can queue all commands for the same entity (like a card or a board) and make sure that they are processed sequentially, while events for different entities (different cards) can be processed in parallel. Then I can use processed commands as events. One problem here is that changes to an entity may generate several events that may not correspond to a single command. I will have to break down to very fine-grained commands all user actions so I can then translate them to events.
Error reporting and error handling.
If this process is asynchronous, I have to manage error reporting to the client. And also I have to remove or mark commands that failed.
I still have the problem with marking the commands as processed, as there are no transactions. I know I have to make processing of commands idempotent to resolve this problem.
Since Trello used MongoDB and generates actions ('DeleteCardAction', 'CreateCardAction') with changes to entities (Cards, Boards..) I was wondering how do they solve this problem?
Create a new collection called FutureUpdates. Write planned updates to the FutureUpdates collection with a single document defining the changes you plan to make to cards and the events you plan to generate. This insert will be atomic.
Now take a [ChangeStream][1] of the FutureUpdates collection this will give you the stream of updates you need to make. Take each doc from the change stream and apply the updates. Finally, update the doc in FutureUpdates to mark it as complete. Again this update will be atomic.
When you apply the updates to Events and Cards make sure to include the objectID of the doc used to create the update in FutureUpdates.
Now if the program crashes after inserting the update in FutureUpdates you can check the Events and Cards collections for the existence of records containing the objectID of the update. If they are not present then you can reapply the missing updates.
If the updates have been applied but the FutureUpdate doc is not marked as complete we can update that during recovery to complete the process.
Effectively you are continuously atomically updating a doc for each change in FutureUpdates to track progress. Once an update is complete you can archive the old docs or just delete them.

CQRS and Passing Data

Suppose I have an aggregate containing some data and when it reaches a certain state, I'd like to take all that state and pass it to some outside service. For argument and simplicity's sake, lets just say it is an aggregate that has a list and when all items in that list are checked off, I'd like to send the entire state to some outside service. Now when I'm handling the command for checking off the last item in the list, I'll know that I'm at the end but it doesn't seem correct to send it to the outside system from the processing of the command. So given this scenario what is the recommended approach if the outside system requires all of the state of the aggregate. Should the outside system build its own copy of the data based on the aggregate events or is there some better approach?
Should the outside system build its own copy of the data based on the aggregate events.
Probably not -- it's almost never a good idea to share the responsibility of rehydrating an aggregate from its history. The service that owns the object should be responsible for rehydration.
First key idea to understand is when in the flow the call to the outside service should happen.
First, the domain model processes the command arguments, computing the update to the event history, including the ChecklistCompleted event.
The application takes that history, and saves it to the book of record
The transaction completes successfully.
At this point, the application knows that the operation was successful, but the caller doesn't. So the usual answer is to be thinking of an asynchronous operation that will do the rest of the work.
Possibility one: the application takes the history that it just saved, and uses that history to create schedule a task to rehydrate a read-only copy of the aggregate state, and then send that state to the external service.
Possibility two: you ditch the copy of the history that you have now, and fire off an asynchronous task that has enough information to load its own copy of the history from the book of record.
There are at least three ways that you might do this. First, you could have the command schedule the task as before.
Second, you could have a event handler listening for ChecklistCompleted events in the book of record, and have that handler schedule the task.
Third, you could read the ChecklistCompleted event from the book of record, and publish a representation of that event to a shared bus, and let the handler in the external service call you back for a copy of the state.
I was under the impression that one bounded context should not reach out to get state from another bounded context but rather keep local copies of the data it needed.
From my experience, the key idea is that the services shouldn't block each other -- or more specifically, a call to service B should not block when service A is unavailable. Responding to events is fundamentally non blocking; does it really matter that we respond to an asynchronously delivered event by making an asynchronous blocking call?
What this buys you, however, is independent evolution of the two services - A broadcasts an event, B reacts to the event by calling A and asking for a representation of the aggregate that B understands, A -- being backwards compatible -- delivers the requested representation.
Compare this with requiring a new release of B every time the rehydration logic in A changes.
Udi Dahan raised a challenging idea - the notion that each piece of data belongs to a singe technical authority. "Raw business data" should not be replicated between services.
A service is the technical authority for a specific business capability.
Any piece of data or rule must be owned by only one service.
So in Udi's approach, you'd start to investigate why B has any responsibility for data owned by A, and from there determine how to align that responsibility and the data into a single service. (Part of the trick: the physical view of a service can span process boundaries; in other words, a process may be composed from components that belong to more than one service).
Jeppe Cramon series on microservices is nicely sourced, and touches on many of the points above.
You should never externalise your state. Reporting on that state is a function of the read side, as it produces reports and you'll need that data to call the service. The structure of your state is plastic, and you shouldn't have an external service that relies up that structure otherwise you'll have to update both in lockstep which is a bad thing.
There is a blog that puts forward a strong argument that the process manager is the correct place to put this type of feature (calling an external service), because that's the appropriate place for orchestrating events.

Get list of executions filtered by parameter value

I am using Spring-batch 3.0.4 stable. While submitting a job I add some specific parameters to its execution, say, a tag. Jobs information is persisted in the DB.
Later on I will need to retrieve all the executions marked with a particular tag.
Currently I see 2 options:
Get all job instances with org.springframework.batch.core.explore.JobExplorer#findJobInstancesByJobName. For each instance get all available executions with org.springframework.batch.core.explore.JobExplorer#getJobExecutions. Filter the resulting collection of executions checking its JobParameters.
Write my own JdbcTemplate-based DAO implementation to run the select query.
While the former option seems pretty inefficient, the latter one suggests writing extra code to deal with the Spring-specific database tables structure.
Is there any option I am missing here?