Is there a spring batch pattern to persist the unaltered values before the ItemProcessor is called? - spring-batch

Desired step flow:
The expectation is that in the scope of a single XA transaction, the step will:
Select the partitioned data
Write the unaltered partitioned data to an audit table (redo log) using the JDBC BatchUpdate
invoke the ItemProcessor on the unaltered data
use a composite writer to batch write additonal operations to additional targets
The puzzle I am trying to solve is that the process needs to read the partition once only, and operate strictly within the bounds of that selection.
If there is no "wiretap" configuration to accomplish this, a solution I am considering is to implement the batch write to the redo log in the reader itself, but that just feels ugly, like it defeats the purpose of the framework. If there is already a spring batch pattern for this it would be much better.
We are using the spring batch XML definitions because it gives a degree of control over bean scoping that the DSL does not permit.
Since spring integration is already a part of this project for other functions, another option might be to refactor this as a spring integration flow.
(Edit 3: Notes about answers)
Both Artem and Mahmoud's answers are correct. I have fleshed out designs for both scenarios, far beyond what is shown on this question. I wish I could flag them both as correct, so make sure to upvote Artem's answer and give him due credit.
The advantage to Artem's answer is that, with a little refactoring of existing code I could eliminate the spring batch from the worker side of the architecture, and reduce our code to simplify some other operations that are not referenced in this question. The disadvantage is it would require a much more major refactoring and addressing a learning curve that is simply huge.
The biggest advantage to Mahmoud's answer is specific to my case: I can implement it with code paradigms that are already familiar to my team and are already core to the existing product.
Both should be nearly equally efficient. I would actually have to implement both to measure efficiency.
In the context of my project, I have elected to follow Mahmoud's answer because it requires the least changes to the current design and code paradigm.
(Edit: alternative flow build in Spring Integration)
Based on comments from #Artem Bilan, I have started a design to refactor to Spring Integration.
Not visible:
flows and flow elements are created dynamically at runtime by a rules engine, and destroyed when no longer in use
each flow encompasses a chain of readers that may or may not be JDBC
(Edit 2: Refinement of spring batch architecture)
Based on comments from Mahmoud Ben Hassine, there are Listener hooks we can leverage. The refined graphic shows the ItemProcessListener being used for both audit requirements, although this is not necessarily the final code we will use.
Any thoughts or advice is welcome.

Such audit requirements can be implemented using listeners. The ItemProcessListener, and the similar ItemReadListener and ItemWriteListener, are the injection points that allow you to plug custom code before/after each phase of the chunk-oriented processing model.
So for your use case, I would keep it simple and use a ItemProcessListener#beforeProcess(T item) to persist/log the "unaltered" items.

You may implement such a scope with Spring Integration #MessagingGateway, where its gateway method can be marked with a #Transactional. As long as an integration flow subscribed to the request channel is built with DirectChannel in between endpoints, the whole process will be performed within that transaction.
See more in docs: https://docs.spring.io/spring-integration/docs/current/reference/html/messaging-endpoints.html#gateway

One option to do this in spring batch would be to create your own ItemWriter that processes the items before writing them, similar to this...
public class ItemProcessorAndWriter<I, O> implements ItemWriter<I> {
private final ItemProcessor<I, O> itemProcessor;
private final ItemWriter<O> itemWriter;
public ItemProcessorAndWriter(ItemProcessor<I, O> itemProcessor, ItemWriter<O> itemWriter) {
this.itemProcessor = itemProcessor;
this.itemWriter = itemWriter;
}
#Override
public void write(List<? extends I> items) throws Exception {
List<O> processedItems = new ArrayList<>(items.size());
for (I item : items) {
processedItems.add(itemProcessor.process(item));
}
itemWriter.write(processedItems);
}
}
you could then create a CompositeItemWriter that would contain your Audit writer first, followed by the ItemProcessorAndWriter that does the item processing and writing.

Related

hexagonal architecture and transactions concept

I'm trying to get used to hexagonal architecture and can't get how to implement common practical problems, already realized with different approaches. I think my core problem is to understand level of responsibility extracted to adapter and ports.
Reading articles on the web it is ok with primitive examples like:
we have RepositoryInterface which can be implemented in
mysql/txt/s3/nosql storage
or
we have NotificationSendingInterface and have email/sms/web push realizations
but those are very refined examples and simply interface/realization details separation.
In practice, however, coding service in domain model we usually know interface+realization guarantees more deeply.
For illustration purpose example I decided to ask about storage+transaction pair.
How transaction conception for storage should be implemented in hex architecture?
Assume we have simple crud service interface inside domain level
StorageRepoInterface
save(...)
update(...)
delete(...)
get(...)
and we want some kind of transaction guarantee while working with those methods, e.g. delete+save in one transaction.
How it should be designed and implemented according to hex conception?
Is it should be implemented with some external coordination interface of TransactionalOperation? If yes, then in general, TransactionalOperation must know how to implement transaction guaranty working with all implementations of StorageRepoInterface(mb within additional transaction-oriented operation interface)
If no, then seems there should be explicit transaction guarantees from StorageRepoInterface in the domain level(inside hex) with additional methods?
Either way it is no look so "isolated and interfaced based" as stated.
Can someone point me how to change mindset correctly for such situations or where to read?
Thanks in advance.
In Hex Arch, driver ports are the API of the application, the use case boundary. Use cases are transactional. So you have to control the transactionality at the driver ports methods. You enclose every method in a transaction.
If you use Spring you could use declarative transaction (#Transactional annotation).
Another way is to explicity open a db transaction before the execution of the method, and to close (commit / rollback) it after the method.
A useful pattern for applying transactionality is the command bus, wrapping it with a decorator which enclose the command in a transaction.
Transactions are infraestructure, so you should have a driven port and an adapter implementing the port.
The implementation must use the same db context (entity manager) used by persistence adapters (repositories).
Vaughn Vernon talks about this topic in the "Managing transactions" section (pages 432-437) of his book "Implementing DDD".
Instead of using command bus pattern, you could simply inject a TransactionPort to your command handler (defined at domain level).
The TransactionPort would have two methods (start and commit).
The TransactionAdapter would be your custom implementation (defined at infrastructure level).
Then you could do somethig like:
this.transactionalPort.start();
# Do you stuff
this.transactionalPort.commit();

Composite unique constraint on business fields with Axon

We leverage AxonIQ Framework in our system. We've faced a problem implementing composite uniq constraint based on aggregate business fields.
Consider following Aggregate:
#Aggregate
public class PersonnelCardAggregate {
#AggregateIdentifier
private UUID personnelCardId;
private String personnelNumber;
private Boolean archived;
}
We want to avoid personnelNumber duplicates in the scope of NOT-archived (archived == false) records. At the same time personnelNumber duplicates may exist in the scope of archived records.
Query Side check seems NOT to be an option. Taking into account Eventual Consistency nature of our system, more than one creation request with the same personnelNumber may exist at the same time and the Query Side may be behind.
What the solution would be?
What you're asking is an issue that can occur as soon as you start implementing an application along the CQRS paradigm and DDD modeling techniques.
The PersonnelCardAggregate in your scenario maintains the consistency boundary of a single "Personnel Card". You are however looking to expand this scope to achieve a uniqueness constraints among all Personnel Cards in your system.
I feel that this blog explains the problem of "Set Based Consistency Validation" you are encountering quite nicely.
I will not iterate his entire blog, but he sums it up as having four options to resolving the problem:
Introduce locking, transactions and database constraints for your Personnel Card
Use a hybrid locking field prior to issuing your command
Really on the eventually consistent Query Model
Re-examine the domain model
To be fair, option 1 wont do if your using the Event-Driven approach to updating your Command and Query Model.
Option 3 has been pushed back by you in the original question.
Option 4 is something I cannot deduce for you given that I am not a domain expert, but I am guessing that the PersonnelCardAggregate does not belong to a larger encapsulating Aggregate Root. Maybe the business constraint you've stated, thus the option to reuse personalNumbers, could be dropped or adjusted? Like I said though, I cannot state this as a factual answer for you, as I am not the domain expert.
That leaves option 2, which in my eyes would be the most pragmatic approach too.
I feel this would require a combination of a cache at your command dispatching side to deal with quick successions of commands to resolve the eventual consistency issue. To capture the occurs that an update still comes through accidentally, I'd introduce some form of Event Handler that (1) knows the entire set of "PersonnelCards" from a personalNumber/archived point of view and (2) can react on a faulty introduction by dispatching a compensating action.
You'd thus introduce some business logic on the event handling side of your application, which I'd strongly recommend to segregate from the application part which updates your query models (as the use cases are entirely different).
Concluding though, this is a difficult topic with several ways around it.
It's not so much an Axon specific problem by the way, but more an occurrence of modeling your application through DDD and CQRS.

Any alternative to BPMN and DMN notations for describing business logic?

I am looking for some tool capable of creating complex process of data manipulation which can be more or less easily modified by people who do not write code.
For example, my task is:
fetch data from sourceA
2.1 if data is full - filter it by condition 45
2.2 if data is not full - fetch additional data from source B
if result passes validation - return 1, otherwise 0
This should be described in some readable manner, best option is if one can modify this process in some UI tool.
What are the requirements?
Each process consists of two parts: steps, and a way to arrange them in a sequence.
(1)
The process in each step should be able to
1. emit commands for fetching some data from data-sources and inserting this into process context
2. filter, enrich, transform datasets obtained
Thus each step of this process should be described with some more or less simple DSL.
(2)
The selection of the step to go, i.e. the consequence of steps should be described by some visual tool, or again, as in (1), with some simple dsl.
Can you advise something for this typical, from my point of view, task?
Meanwhile, here are my own ideas.
First think comes to mind is BPMN combined with Drools.
For steps I may use DRL rules: they can make only basic data manipulation themselves, but I can call Java functions from them if I need something complicated.
For steps consequence I may use standart BPMN diagramm.
Mat be, there is something better?
The combination of BPMN with DMN would allow you indeed to describe with these visual standards, the execution of the process and decision logic to be applied, in order to achieve what in the "For example" paragraph.
In order to make it fully accessible by the business people, the BPMN task for fetching the data or performing any interaction with external system, should be prepared in advance and made available during the composition of the BPMN/DMN diagrams.
Alternatively to BPMN+DMN combination, you can look into Fuse or Fuse Online, it cannot describe all the semantics of the BPMN+DMN combination, but with Fuse Online for instance you can fully visually implement the steps you described in the "For example" paragraph.

Spring Batch: dynamic composite reader/processor/writer

I've seen this (2010) and this (SO, 2012), but still have not got the answer I need...
Is there an option in Spring Batch to have a dynamic composite reader/processor/writer?
The idea is to have the ability to replace processor at runtime, and in case of multiple processors (AKA composite-processor), to have the option to add/remove/replace/change order of processors. As mentioned, same for reader/writer.
I thought of something like reading the processors list from DB (using cache?) and there the items (beans' names) can be changed. Does this make sense?
EDIT - why do I need this?
There are cases that I use processors as "filters", and it may occur that the business (the client) may change the requirements (yes, it is very annoying) and ask to switch among filters (change the priority).
Other use case is having multiple readers to get the data from different data warehouse, and again - the client changes the warehouse from time to time (integration phase), and I do not want my app to be restarted each and every time. There are many other use cases, of course. plus this.
Thanks
I've started working on this project:
https://github.com/OhadR/spring-batch-dynamic-composite
that implements the requirements in the question above. If someone wanna contribute - feel free!

What is the best practice in EF Core for using parallel async calls with an Injected DbContext?

I have a .NET Core 1.1 API with EF Core 1.1 and using Microsoft's vanilla setup of using Dependency Injection to provide the DbContext to my services. (Reference: https://learn.microsoft.com/en-us/aspnet/core/data/ef-mvc/intro#register-the-context-with-dependency-injection)
Now, I am looking into parallelizing database reads as an optimization using WhenAll
So instead of:
var result1 = await _dbContext.TableModel1.FirstOrDefaultAsync(x => x.SomeId == AnId);
var result2 = await _dbContext.TableModel2.FirstOrDefaultAsync(x => x.SomeOtherProp == AProp);
I use:
var repositoryTask1 = _dbContext.TableModel1.FirstOrDefaultAsync(x => x.SomeId == AnId);
var repositoryTask2 = _dbContext.TableModel2.FirstOrDefaultAsync(x => x.SomeOtherProp == AProp);
(var result1, var result2) = await (repositoryTask1, repositoryTask2 ).WhenAll();
This is all well and good, until I use the same strategy outside of these DB Repository access classes and call these same methods with WhenAll in my controller across multiple services:
var serviceTask1 = _service1.GetSomethingsFromDb(Id);
var serviceTask2 = _service2.GetSomeMoreThingsFromDb(Id);
(var dataForController1, var dataForController2) = await (serviceTask1, serviceTask2).WhenAll();
Now when I call this from my controller, randomly I will get concurrency errors like:
System.InvalidOperationException: ExecuteReader requires an open and available Connection. The connection's current state is closed.
The reason I believe is because sometimes these threads try to access the same tables at the same time. I know that this is by design in EF Core and if I wanted to I could create a new dbContext every time, but I am trying to see if there is a workaround. That's when I found this good post by Mehdi El Gueddari: http://mehdi.me/ambient-dbcontext-in-ef6/
In which he acknowledges this limitation:
an injected DbContext prevents you from being able to introduce multi-threading or any sort of parallel execution flows in your services.
And offers a custom workaround with DbContextScope.
However, he presents a caveat even with DbContextScope in that it won't work in parallel (what I'm trying to do above):
if you attempt to start multiple parallel tasks within the context of
a DbContextScope (e.g. by creating multiple threads or multiple TPL
Task), you will get into big trouble. This is because the ambient
DbContextScope will flow through all the threads your parallel tasks
are using.
His final point here leads me to my question:
In general, parallelizing database access within a single business transaction has little to no benefits and only adds significant complexity. Any parallel operation performed within the context of a business transaction should not access the database.
Should I not be using WhenAll in this case in my Controllers and stick with using await one-by-one? Or is dependency-injection of the DbContext the more fundamental problem here, therefore a new one should instead be created/supplied every time by some kind of factory?
Using any context.XyzAsync() method is only useful if you either await the called method or return control to a calling thread that's doesn't have context in its scope.
A DbContext instance isn't thread-safe: you should never ever use it in parallel threads. Which means, just for sure, never use it in multiple threads anyway, even if they don't run parallel. Don't try to work around it.
If for some reason you want to run parallel database operations (and think you can avoid deadlocks, concurrency conflicts etc.), make sure each one has its own DbContext instance. Note however, that parallelization is mainly useful for CPU-bound processes, not IO-bound processes like database interaction. Maybe you can benefit from parallel independent read operations but I would certainly never execute parallel write processes. Apart from deadlocks etc. it also makes it much harder to run all operations in one transaction.
In ASP.Net core you'd generally use the context-per-request pattern (ServiceLifetime.Scoped, see here), but even that can't keep you from transferring the context to multiple threads. In the end it's only the programmer who can prevent that.
If you're worried about the performance costs of creating new contexts all the time: don't be. Creating a context is a light-weight operation, because the underlying model (store model, conceptual model + mappings between them) is created once and then stored in the application domain. Also, a new context doesn't create a physical connection to the database. All ASP.Net database operations run through the connection pool that manages a pool of physical connections.
If all this implies that you have to reconfigure your DI to align with best practices, so be it. If your current setup passes contexts to multiple threads there has been a poor design decision in the past. Resist the temptation to postpone inevitable refactoring by work-arounds. The only work-around is to de-parallelize your code, so in the end it may even be slower than if you redesign your DI and code to adhere to context per thread.
It came to the point where really the only way to answer the debate was to do a performance/load test to get comparable, empirical, statistical evidence so I could settle this once and for all.
Here is what I tested:
Cloud Load test with VSTS # 200 users max for 4 minutes on a Standard Azure webapp.
Test #1: 1 API call with Dependency Injection of the DbContext and async/await for each service.
Results for Test #1:
Test #2: 1 API call with new creation of the DbContext within each service method call and using parallel thread execution with WhenAll.
Results for Test #2:
Conclusion:
For those who doubt the results, I ran these tests several times with varying user loads, and the averages were basically the same every time.
The performance gains with parallel processing in my opinion is insignificant, and this does not justify the need for abandoning Dependency Injection which would create development overhead/maintenance debt, potential for bugs if handled wrong, and a departure from Microsoft's official recommendations.
One more thing to note: as you can see there were actually a few failed requests with the WhenAll strategy, even when ensuring a new context is created every time. I am not sure the reason for this, but I would much prefer no 500 errors over a 10ms performance gain.