Detect Failover of MongoDB-Cluster with Spring-Data-MongoDB - mongodb

Current Situation
we have a MongoDB-Cluster with 1 primary node and 2 secondary nodes
our Spring-Boot application is using the Spring-Data-MongoDB framework to read/write
from/to the cluster
Problem
in some circumstances the MongoDB cluster will change the primary node (for example
during the resizing of the cluster)
this fail-over phase will affect our Spring-Boot application
when some reads or writes are still ongoing and the fail-over happens, we receive an
exception, because the mongoDB-Server is not reachable anymore for our application
we have to deal with this state somehow
Questions
1. What is the best way to handle those faile-over states ?
I've come across the following documentation:
retryable writes
retryable reads
would it be sufficient to set the retryReads and retryWrites flag to true and specify the primary node and the secondary nodes in the connection url? Or should we catch the connection-exception (or alternatively listen to some fail-over-event) and handle those cases by ourself ?
we also have to deal with the following problem: what happens if only 50 % of some bulk-write data got successfully written to the primary node and the other 50 % not ? How handle those cases ideally ?
this leads us to the second question ...
2. How to detect the fail-over event in Spring-Boot ?
for our application a possible solution would be to automatically detect the failover state of the MongoDB-Cluster and than just trigger a restart of our Spring-Boot application.
is there a way to listen to a specific MongoDB-event via spring-data-mongodb in order deal with the case that the primary node has changed?
alternatively: is there a specific exception we should catch and handle?
I hope somebody can help us here.
Thank you in advance!

Related

Hazelcast IMap Lock not working on kubernetes across different pods

We are using Hazelcast 4 to implement distributed locking across two pods on kuberentes.
We have developed distributed application, two pods of micro service has been created. Both instances are getting auto discovered and forming members.
We are trying to use IMap.lock(key) method to achieve distributed locking across two pods however both pods are acquiring lock at same time, thereby executing the business logic at the concurrently. Also hazelcast management center shows zero locks for the created Imap.
Can you please help on how to achieve synchronization of imap lock(key) so that single pod get the lock for given key at given point of time ?
Code Snippet:-
HazelcastInstance client = HazelcastClient.newHazelcastClient(clientConfig);
try{
IMap map = client.getMap("customers");
map.lock( key );
//business logic
} finally {
map.unlock( key );
}
}
Can you create an mvce and confirm the version of Hazelcast used please.
There are tests for locks here that you can perhaps use as a way to simplify to determine where the fault lies.

Kubernetes - How do I prevent duplication of work when there are multiple replicas of a service with a watcher?

I'm trying to build an event exporter as a toy project. It has a watcher that gets informed by the Kubernetes API every time an event, and as a simple case, let's assume that it wants to store the event in a database or something.
Having just one running instance is probably susceptible to failures, so ideally I'd like two. In this situation, the naive implementation would both instances trying to store the event in the database so it'd be duplicated.
What strategies are there to de-duplicate? Do I have to do it at the database level (say, by using some sort of eventId or hash of the event content) and accept the extra database load or is there a way to de-duplicate at the instance level, maybe built into the Kubernetes client code? Or do I need to implement some sort of leader election?
I assume this is a pretty common problem. Is there a more general term for this issue that I can search on to learn more?
I looked at the code for GKE event exporter as a reference but I was unable to find any de-duplication, so I assume that it happens on the receiving end.
You should use both leader election and de-duplication at your watcher level. Only one of them won't be enough.
Why need leader election?
If high availability is your main concern, you should have leader election between the watcher instances. Only the leader pod will write the event to the database. If you don't use leader election, the instances will race with each other to write into the database.
You may check if the event has been already written in the database and then write it. However, you can not guarantee that other instances won't write into the database between when you checked and when you write the event. In that case, database level lock / transaction might help.
Why need de-duplication?
Only leader election will not save you. You also need to implement de-duplication. If your leader pod restart, it will resync all the existing events. So, you should have a check whether to process the event or not.
Furthermore, if a failover happen, how you know from the new leader about which events were successfully exported by previous leader?

How to identify added and modified children after reconnecting to Zookeeper?

We use Zookeeper to coordinate task execution among our clustered servers. One of our customers have a very instable network and our servers keep disconnecting and reconnecting to Zookeeper.
The problem is that while being disconnected, our servers will miss the events that occurred and won't handle them even after re-connecting to Zookeeper again.
Is there a recommened\standard method to handle such situations using Zookeeper and Apache Curator ?
How to identify the current epoch time at Zookeeper ?
My proposal so far is:
We keep track of the last time we were connected to Zookeeper. That's right before we get disconnected.
On re-connecting again, we ask the listener to clearAndRefresh which fires CHILD_ADDED events for all child nodes for monitored path.
On handling these CHILD_ADDED events, we only handled those for paths that were created or modified after the last time we were connected to Zookeeper.
I don't think using timestamp will be a good idea. Instead, you can use Curator's inbuilt:
TreeCache if you want to watch an entire tree
PathChildrenCache if you want to watch only a sub directory.
It doesn't matter which one you use, both support listening to ChildAdded and DataChanged events which will do exactly what you need. When you reconnect after been disconnected, Curator will internally evaluate newly added children and compare data of existing children to determine changes. No pressure on you. You only need to use the listeners provided.
In terms of accuracy TreeCache is not guaranteeing 100% accuracy. So, you it is better if you can re-design you approach to use PathChildrenCache instead.

Synchronising transactions between database and Kafka producer

We have a micro-services architecture, with Kafka used as the communication mechanism between the services. Some of the services have their own databases. Say the user makes a call to Service A, which should result in a record (or set of records) being created in that service’s database. Additionally, this event should be reported to other services, as an item on a Kafka topic. What is the best way of ensuring that the database record(s) are only written if the Kafka topic is successfully updated (essentially creating a distributed transaction around the database update and the Kafka update)?
We are thinking of using spring-kafka (in a Spring Boot WebFlux service), and I can see that it has a KafkaTransactionManager, but from what I understand this is more about Kafka transactions themselves (ensuring consistency across the Kafka producers and consumers), rather than synchronising transactions across two systems (see here: “Kafka doesn't support XA and you have to deal with the possibility that the DB tx might commit while the Kafka tx rolls back.”). Additionally, I think this class relies on Spring’s transaction framework which, at least as far as I currently understand, is thread-bound, and won’t work if using a reactive approach (e.g. WebFlux) where different parts of an operation may execute on different threads. (We are using reactive-pg-client, so are manually handling transactions, rather than using Spring’s framework.)
Some options I can think of:
Don’t write the data to the database: only write it to Kafka. Then use a consumer (in Service A) to update the database. This seems like it might not be the most efficient, and will have problems in that the service which the user called cannot immediately see the database changes it should have just created.
Don’t write directly to Kafka: write to the database only, and use something like Debezium to report the change to Kafka. The problem here is that the changes are based on individual database records, whereas the business significant event to store in Kafka might involve a combination of data from multiple tables.
Write to the database first (if that fails, do nothing and just throw the exception). Then, when writing to Kafka, assume that the write might fail. Use the built-in auto-retry functionality to get it to keep trying for a while. If that eventually completely fails, try to write to a dead letter queue and create some sort of manual mechanism for admins to sort it out. And if writing to the DLQ fails (i.e. Kafka is completely down), just log it some other way (e.g. to the database), and again create some sort of manual mechanism for admins to sort it out.
Anyone got any thoughts or advice on the above, or able to correct any mistakes in my assumptions above?
Thanks in advance!
I'd suggest to use a slightly altered variant of approach 2.
Write into your database only, but in addition to the actual table writes, also write "events" into a special table within that same database; these event records would contain the aggregations you need. In the easiest way, you'd simply insert another entity e.g. mapped by JPA, which contains a JSON property with the aggregate payload. Of course this could be automated by some means of transaction listener / framework component.
Then use Debezium to capture the changes just from that table and stream them into Kafka. That way you have both: eventually consistent state in Kafka (the events in Kafka may trail behind or you might see a few events a second time after a restart, but eventually they'll reflect the database state) without the need for distributed transactions, and the business level event semantics you're after.
(Disclaimer: I'm the lead of Debezium; funnily enough I'm just in the process of writing a blog post discussing this approach in more detail)
Here are the posts
https://debezium.io/blog/2018/09/20/materializing-aggregate-views-with-hibernate-and-debezium/
https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
first of all, I have to say that I’m no Kafka, nor a Spring expert but I think that it’s more a conceptual challenge when writing to independent resources and the solution should be adaptable to your technology stack. Furthermore, I should say that this solution tries to solve the problem without an external component like Debezium, because in my opinion each additional component brings challenges in testing, maintaining and running an application which is often underestimated when choosing such an option. Also not every database can be used as a Debezium-source.
To make sure that we are talking about the same goals, let’s clarify the situation in an simplified airline example, where customers can buy tickets. After a successful order the customer will receive a message (mail, push-notification, …) that is sent by an external messaging system (the system we have to talk with).
In a traditional JMS world with an XA transaction between our database (where we store orders) and the JMS provider it would look like the following: The client sets the order to our app where we start a transaction. The app stores the order in its database. Then the message is sent to JMS and you can commit the transaction. Both operations participate at the transaction even when they’re talking to their own resources. As the XA transaction guarantees ACID we’re fine.
Let’s bring Kafka (or any other resource that is not able to participate at the XA transaction) in the game. As there is no coordinator that syncs both transactions anymore the main idea of the following is to split processing in two parts with a persistent state.
When you store the order in your database you can also store the message (with aggregated data) in the same database (e.g. as JSON in a CLOB-column) that you want to send to Kafka afterwards. Same resource – ACID guaranteed, everything fine so far. Now you need a mechanism that polls your “KafkaTasks”-Table for new tasks that should be send to a Kafka-Topic (e.g. with a timer service, maybe #Scheduled annotation can be used in Spring). After the message has been successfully sent to Kafka you can delete the task entry. This ensures that the message to Kafka is only sent when the order is also successfully stored in application database. Did we achieve the same guarantees as we have when using a XA transaction? Unfortunately, no, as there is still the chance that writing to Kafka works but the deletion of the task fails. In this case the retry-mechanism (you would need one as mentioned in your question) would reprocess the task an sends the message twice. If your business case is happy with this “at-least-once”-guarantee you’re done here with a imho semi-complex solution that could be easily implemented as framework functionality so not everyone has to bother with the details.
If you need “exactly-once” then you cannot store your state in the application database (in this case “deletion of a task” is the “state”) but instead you must store it in Kafka (assuming that you have ACID guarantees between two Kafka topics). An example: Let’s say you have 100 tasks in the table (IDs 1 to 100) and the task job processes the first 10. You write your Kafka messages to their topic and another message with the ID 10 to “your topic”. All in the same Kafka-transaction. In the next cycle you consume your topic (value is 10) and take this value to get the next 10 tasks (and delete the already processed tasks).
If there are easier (in-application) solutions with the same guarantees I’m looking forward to hear from you!
Sorry for the long answer but I hope it helps.
All the approach described above are the best way to approach the problem and are well defined pattern. You can explore these in the links provided below.
Pattern: Transactional outbox
Publish an event or message as part of a database transaction by saving it in an OUTBOX in the database.
http://microservices.io/patterns/data/transactional-outbox.html
Pattern: Polling publisher
Publish messages by polling the outbox in the database.
http://microservices.io/patterns/data/polling-publisher.html
Pattern: Transaction log tailing
Publish changes made to the database by tailing the transaction log.
http://microservices.io/patterns/data/transaction-log-tailing.html
Debezium is a valid answer but (as I've experienced) it can require some extra overhead of running an extra pod and making sure that pod doesn't fall over. This could just be me griping about a few back to back instances where pods OOM errored and didn't come back up, networking rule rollouts dropped some messages, WAL access to an aws aurora db started behaving oddly... It seems that everything that could have gone wrong, did. Not saying Debezium is bad, it's fantastically stable, but often for devs running it becomes a networking skill rather than a coding skill.
As a KISS solution using normal coding solutions that will work 99.99% of the time (and inform you of the .01%) would be:
Start Transaction
Sync save to DB
-> If fail, then bail out.
Async send message to kafka.
Block until the topic reports that it has received the
message.
-> if it times out or fails Abort Transaction.
-> if it succeeds Commit Transaction.
I'd suggest to use a new approach 2-phase message. In this new approach, much less codes are needed, and you don't need Debeziums any more.
https://betterprogramming.pub/an-alternative-to-outbox-pattern-7564562843ae
For this new approach, what you need to do is:
When writing your database, write an event record to an auxiliary table.
Submit a 2-phase message to DTM
Write a service to query whether an event is saved in the auxiliary table.
With the help of DTM SDK, you can accomplish the above 3 steps with 8 lines in Go, much less codes than other solutions.
msg := dtmcli.NewMsg(DtmServer, gid).
Add(busi.Busi+"/TransIn", &TransReq{Amount: 30})
err := msg.DoAndSubmitDB(busi.Busi+"/QueryPrepared", db, func(tx *sql.Tx) error {
return AdjustBalance(tx, busi.TransOutUID, -req.Amount)
})
app.GET(BusiAPI+"/QueryPrepared", dtmutil.WrapHandler2(func(c *gin.Context) interface{} {
return MustBarrierFromGin(c).QueryPrepared(db)
}))
Each of your origin options has its disadvantage:
The user cannot immediately see the database changes it have just created.
Debezium will capture the log of the database, which may be much larger than the events you wanted. Also deployment and maintenance of Debezium is not an easy job.
"built-in auto-retry functionality" is not cheap, it may require much codes or maintenance efforts.

How to run something on each node in service fabric

In a service fabric application, using Actors or Services - what would the design be if you wanted to make sure that your block of code would be run on each node.
My first idea would be that it had to be a Service with instance count set to -1, but also in cases that you had set to to 3 instances. How would you make a design where the service ensured that it ran some operation on each instance.
My own idea would be having a Actor with state controlling the operations that need to run, and it would itterate over services using serviceProxy to call methods on each instance - but thats just a naive idea for which I dont know if its possible or if it is the proper way to do so?
Some background info
Only Stateless services can be given a -1 for instance count. You can't use a ServiceProxy to target a specific instance.
Stateful services are deployed using 1 or more partitions (data shards). Partition count is configured in advance, as part of the service deployment and can't be changed automatically. For instance if your cluster is scaled out, partitions aren't added automatically.
Autonomous workers
Maybe you can invert the control flow by running Stateless services (on all nodes) and have them query a 'repository' for work items. The repository could be a Stateful service, that stores work items in a Queue.
This way, adding more instances (scaling out the cluster) increases throughput without code modification. The stateless service instances become autonomous workers.
(opposed to an intelligent orchestrator Actor)