Spring Batch multiple datasources and ChainedTransactionManager risks - spring-batch

I am doing a study on the feasibility of a Spring Batch composed of two datasources. A SQL datasource for the Spring Batch metadata and a MongoDB datasource (with transactional use) for the business data. The transactional aspect raises several questions here.
The following topic: Spring batch with MongoDB and transactions and related resources provide a number of answers to my questions.
The answer mentions the use of Spring's JtaTransactionManager to manage distributed transactions on the two datasources.
This technique uses the 2PC protocol. It is also the most robust solution if I understood correctly. https://www.infoworld.com/article/2077963/distributed-transactions-in-spring--with-and-without-xa.html?page=2
On the other hand, I found some resources about Spring's ChainedTransactionManager. This technique uses the best effort 1PC protocol. This solution is less robust, if I understand correctly the system can be in an inconsistent state in case of a problem in the infrastructure (network failure for example).
The ChainedTransactionManager has the advantage of being easier to implement and offers better performance. I saw that it is deprecated https://github.com/spring-projects/spring-data-commons/issues/2232.
What are the concrete risks of using the ChainedTransactionManager in a Spring Batch? In case of an error, can I have inconsistencies between the Spring batch metadata and the business data in Mongo?
I imagine there are also considerations to take into account with retry or chunk skip strategies?
Thanks a lot for your help.

In case of an error, can I have inconsistencies between the Spring batch metadata and the business data in Mongo?
Yes, that's is the risk you should be aware of.
A common technique to avoid that is to disable state management and use the process indicator pattern. You can find an example here.

Related

Query vs Transaction

In this picture, we can see saga is the one that implements transactions and cqrs implements queries. As far as I know, a transaction is a set of queries that follow ACID properties.
So can I consider CQRS as an advanced version of saga which increases the speed of reads?
Note that this diagram is not meant to explain what Sagas and CQRS are. In fact, looking at it this way it is quite confusing. What this diagram is telling you is what patterns you can use to read and write data that spans multime microservices. It is saying that in order to write data (somehow transactionally) across multiple microservices you can use Sagas and in order to read data which belongs to multiple microservices you can use CQRS. But that doesn't mean that Sagas and CQRS have anything in common. They are two different patterns to solve completely different problems (reads and writes). To make an analogy, it's like saying that to make pizzas (Write) you can use an oven and to view the pizzas menu (Read) you can use a tablet.
On the specific patterns:
Sagas: you can see them as a process manager or state machine. Note that they do not implement transactions in the RDBMS sense. Basically, they allow you to create a process that will take care of telling each microservice to do a write operation and if one of the operations fails, it'll take care of telling the other microservices to rollback (or compensate) the action that they did. So, these "transactions" won't be atomic, because while the process is running some microservices will have already modified the data and others won't. And it is not garanteed that whatever has succeed can sucessfully be rolled back or compensated.
CQRS (Command Query Responsibility Segregation): suggests the separation of Commands (writes) and Queries (Reads). The reason for that, it is what I was saying before, that the reads and writes are two very different operations. Therefore, by separating them, you can implement them with the patterns that better fit each scenario. The reason why CQRS is shown in your diagram as a solution for reading data that comes from multiple microservices is because one way of implementing queries is to listen to Domain Events coming from multiple microservices and storing the information in a single database so that when it's time to query the data, you can find it all in a single place. An alternative to this would be Data Composition. Which would mean that when the query arrives, you would submit queries to multiple microservices at that moment and compose the response with the composition of the responses.
So can I consider CQRS as an advanced version of saga which increases the speed of reads?
Personally I would not mix the concepts of CQRS and Sagas. I think this can really confuse you. Consider both patterns as two completely different things and try to understand them both independently.

How to write data to both NoSQL and RDBMS simultaneously and efficiently

Let’s assume a setup where a mobile application is communicating with its backend via an API, and data resulting from this communication (eg JSON- based transaction writes among others) is written into and read from a MongoDB instance.
Now since I would like to perform some heavy analytics on data stored in mongo, should I rather:
save data directly to RDBMS at the same time as I write to Mongo (so the backend service calls Mongo and after successful write also calls RDBMS)
perform read from Mongo (with some intervals) and load fresh data into RDBMS
I am afraid that both of those solutions require also re-engineering theoretically schema-less Mongo to be in constant agreement with relations and schema in RDBMS. Does it really require more planning for any document structure changes in Mongo? I intuitively say yes, but I look for real world examples. I hope my point is clear enough.
Maybe CQRS pattern will be good for You.
See: https://martinfowler.com/bliki/CQRS.html
You can use RDBMS for Write Model. Mongo - for Read Model.
After every write operation to RDBMS You should update Your ReadModel (MongoDB Document) based on data from Write Model.
There are a few constraints that need to be understood before you embark on a solution here. The most relevant of these is latency. How out-of-date can your data be?
You are almost definitely looking at some kind of write-behind solution here, taking data out of MongoDB, and writing it to your data warehouse. The question is, how far behind your MongoDB can your data warehouse be? Many solutions based on an extract-transform-load model (ETL) work on a nightly basis, so as to minimize impact on the online system. Some can do the same on an hourly basis, but will have more potential impact on the live system.
Transaction-by-transaction support is likely not needed for an analysis system. You really want to avoid this if you can, as it puts far more load on both systems than is usually justified.
To answer your second question, yes, once you start depending on a schema, it needs to be stable. It doesn't have to be synced up with your target schema necessarily, but your ETL process will have to be aware of both, and will have to be modified any time either one materially changes. Being "schema-less" doesn't mean there isn't a schema, it just means that the schema is not enforced by the software, instead it is enforced by the dependencies on the system.
I think the option with least engineering effort is to use a Kafka connector for MongoDB, such that the connector will read the MongoDB changes from the oplog in near-real time and write the event in Kafka. Then from Kafka you can write the data to a relational DB using a stream processing.
Dual write from UI is not a good option as it can introduce latency, complexity and opeeational overhead. What if the write to one DB fails?

Implement Lucene on Existing .NET / SQL Server stack with multiple webservers - store indexes in the database?

This article offered me a huge amount of information:
Implement Lucene on Existing .NET / SQL Server stack with multiple webservers
I'd like to follow on from this by asking about the notion of implementing a Lucene Directory that would persist the indexes to the database (in my case SQL Server) - if anyone has a SWAG on effort that would be helpful.
I can see that the Java realm has this (e.g. Compass), and I'm really hoping the Stackoverflow folks might have considered this to? Any feedback would be appreciated.
My rookie thinking is that persisting indexes to the DB would be a way to solve for the 'distribution' problem. So instead of implementing messaging (not possible for my software because of deployment restrictions), or scheduling (would be ok'ish - product folks always get jumpy in making decisions about how 'current' indexed data has to be), the IndexReader reopen() would efficiently update the index snapshot on whichever server node.
Does this work if DB concurrency/load is not the heart of the problem being solved? - our use is focused around facilitating different data analysis on fields which in turns facilitates different forms of matching.
Our deployment architecture/restrictions do not really allow us to insist on dedicated servers ala SOLR, so this notion of distribution has been discounted by us.
How much index changes do you await? When do you want to read in the index? (On application startup?) Putting the index into the database and "downloading" it on index creation might consume too much resources.
Not sure about your deployment restrictions, but can you have a shared file space for your machines (e.g. SMB/NFS share or similar, or even a SAN-based solution)?
I would be a bit afraid of performance issues with the indexes in the db. Have a look at Elasticsearch. It's the successor of compass. It requires Java, but has a very neat REST interface for your .NET solution. Elasticsearch supports distribution and replication between several nodes. You can run it on the webserver nodes.
This solution will kill performance of the index, since it has to retrieve it from the DB.
I would highly recommend moving to a newer/better alternative, that is Solr (using Solr.NET for example) or ElasticSearch (using NEST)
Solr is a high level interface/manager for Lucene indexes, with a simplified configuration, clustering, replication, etc. solved for you. The nice thing is that if you have some exp. with Lucene, this will not be such a big step
ElasticSearch is a different approach but it's not hard to learn.

Replacing HBase caching with a separate Caching product

Can the caching mechanism used in HBase be replaced with a different Caching product such as Memcache, Gemfire, GridGain etc. ?
HBase is very different from something like GridGain. Columnar mini-database (no transactions, etc.) vs. fully transactional in-memory data platform. Different goals, different approaches.
Replacing is probably right approach IF you have a proper use case.

Data Synchronization in a Distributed system

We have an REST-based application built on the Restlet framework which supports CRUD operations. It uses a local-file to store the data.
Now the requirement is to deploy this application on multiple VMs and any update operation in one VM needs to be propagated other application instances running on other VMs.
Our idea to solve this was to send multiple POST msgs (to all other applications) when a update operation happens in a given VM.
The assumption here is that each application has a list/URLs of all other applications.
Is there a better way to solve this?
Consistency is a deep topic, and a hard thing to get right. The trouble comes when two nearly-simultaneous changes occur to the same data: conflicting updates can arrive in one order on one server, and in another order on another. This is a problem, since the two servers no longer agree on what the data is, and it isn't clear who is "right".
The short-story: get your favorite RDBMS (for example, mysql is popular) and have your app servers connect to in what is called the three-tier model. Be sure to perform complex updates in transactions, which will provide an acceptable consistency model.
The long-story: The three-tier model serves well for small-to-medium scale web sites/services. You will eventually find that the single database becomes the bottleneck. For services whose read traffic is substantially larger than write traffic, a common optimization is to create a single-master, many-slave database replication arrangement, where all writes go to the single master (required for consistency with non-distributed transactions), but the more-common reads could go to any of the read slaves.
For services with evenly-mixed read/write traffic, you may be better served by dropped some of the conveniences (and accompanying restrictions) that formal SQL provides and instead use of one of the various "nosql" data stores that have recently emerged. Their relative merits and fitness for various problems is a deep topic in itself.
I can see 7 major options for now. You should find out more details and decide whether the facilities / trade-offs are appropriate for your purpose
Perform the CRUD operation on a common RDBMS. Simplest and most consistent
Perform the CRUD operations on a common RDBMS which runs as fast in-memory RDBMS. eg TimesTen from Oracle etc
Perform the CRUD on a distributed cache or your own home cooked distributed hash table which can guarantee synchronization eg Hazelcast/ehcache and others
Use a fast common state server like REDIS/memcached and perform your updates
in a synchronized manner on it and write out the successfull operations to a DB in a lazy manner if required.
Distribute your REST servers such that the CRUD operations on a single entity are only performed by a single master. Once this is done, the details about the changes can be communicated to everyone else using a reliable message bus or a distributed database (eg postgres) that runs underneath and syncs all of your updates fairly fast.
Target eventual consistency and use a distributed data store like Cassandra which lets you target the consistency you require
Use distributed consensus algorithms like Paxos or RAFT or an implementation of the same(recommended) like zookeeper or etcd respectively and take ownership of the item you want to change from each REST server before you perform the CRUD operation - might be a bit slow though and same stuff is what Cassandra might give you.