What happens to the application when ApplicationIntent=ReadOnly is set and secondary node goes down? - alwayson

Failover has occurred. There is only one node as the primary node. What happens to the application when ApplicationIntent=ReadOnly is set?

Related

Mongo cluster, safe reboot for secondary

Let's say we have a Mongo cluster (3 or more nodes). Realized that even quick restart of Secondary node affect Primary. We need to shutdown Secondary for some reason for short time. What is the best/correct procedure (with particular commands examples please)? Should we remove the node from cluster or just set it to maintenance is enough, like?:
mongocluster:SECONDARY> db.adminCommand({"replSetMaintenance":true})
does this command affect only one particular Secondary node where it was applied?
do we need to switch the particular node to hidden mode for maintenance also?
do we need to switch the particular node to delayed replica mode for maintenance also?

Issues with failback for logical replication using postgres

I am using PostgreSQL 13 to set up logical replication. When the original active node (A) becomes the secondary, the prior secondary(B) which is now active node needs to sync to the (A) node.
To summarize the issue:
Node A is active and fails at some point of time. Node B takes over and now running is active and accepting I/O from application. Now when Node A is recovered from failure and ready to become active again. In order to happen this Node A is trying to get the data which may be have been added while Node A was down. To get the this data Node A is creating a subscription to Node B which is now acting as a publisher. Issue is that this subscription on Node A fails as Node A already has some data before it went down and this data results in conflicts.
So what are my options here?

Mongo behaviour once master is down?

Consider the below diagram in MongoDB
I have two scenarios
Scenario 1 :-
Router directs the write call to master.Its writen to master but then it goes down before it gets replicted to slaves(i am using
synch replication mode)
Will router select one slave as master and also write the above request to both slaves ?
Scenario 2 :-
Router directs the write call to master. Its writen to master but then network link b/w it and one slave is broken(using
synch replication mode)
Will router select another slave(which is connected to all other nodes) as master and also write the above request to slave ?
Let's first use MongoDB terminologies: Primary instead of master and Secondary instead of slave.
Scenario 1: Will router select one slave as master and also write the above request to both slaves ?
A secondary can become a primary. If the current primary becomes unavailable, the replica set holds an election to choose which of the secondaries becomes the new primary. See also Replica Set Elections.
In scenario 1, if the primary had accepted write operations that the secondaries had not successfully replicated before the primary stepped down, then a rollback will revert the write operations on a former primary when the node rejoins the replica set. See also RollBacks During Replica Set Failover.
You can run all voting members with journaling enabled and use writeConcern majority to prevent rollbacks. See also Avoid Replica Set Rollbacks.
Scenario 2: Will router select another slave(which is connected to all other nodes) as master and also write the above request to slave ?
There are two parts here, the first part is replica set election. In this case because the primary and one of the secondaries are still majority, no election will be held. The primary will still be primary and replicating to one of the secondaries.
The second part is about replication of data. Secondary members copy oplog from their sync source and apply these operations in an asynchronous process. A secondary sync source may automatically change as needed based on changes in the ping time and state of other members’ replication. See also Replica Set Data Synchronization
In scenario 2, the secondary may change its sync source to the other secondary.
You may also found the following useful:
Replica Set High Availability
Replica Set Deployment Architectures
Replica Set Distributed Across Two or More Data Centers

MongoDB replica set failed

I am having a MongoDB Replica set consisting three nodes, 1 Primary, 1 Secondary and one Arbiter.
When I was performing the initial re-sync on secondary node from the primary, the primary node got terminated. When I checked the logs of primary node the exception being shown was
SEVERE: Invalid access at address: 0x7fcde1e00ff0SEVERE: Invalid access at address: 0x7fcde1e00ff0
SEVERE: Got signal: 7 (Bus error)
Since then this primary node is not getting started due to this exception and secondary node is stuck in the STARTUP2 state.
I am able to start the primary node on different port as a standalone node (or in maintenance mode) and read its data. But whenever I am running it as a part of replica set it is getting terminated with above exception
The primary and secondary are having RAID0 as their storage. The data size is around 550GB.
I copied the whole data of primary node(currently down) to the secondary node(in STARTUP2 state) and then restarted the secondary node. But it also didn't worked. Secondary node getting elected to primary on restart but also getting terminated within a second of election with below exception :
SEVERE: Fatal DBException in logOp(): 10334 BSONObj size: 50359410 (0x3006C72) is invalid. Size must be between 0 and 16793600(16MB) First element: 2: ?type=111
SEVERE: terminate() called, printing stack (if implemented for platform):
0x11fd1b1 0x11fc438 0x7ff56dc01846 0x7ff56dc01873 0xe54c9e 0xc4de1b 0xc58f46 0xa0bac1 0xa0c250 0xa0f1bf 0xa0fcc1 0xa1323e 0xa2949a 0xa2af32 0xa2cd36 0xd61654 0xba21a2 0xba3780 0x7724a9 0x11b2fde
How to recover and restore the replica set in this case.
I am also having the backup of this data. Can I drop this replica set and recreate the replica set with this backup data ?
There is another replica set in this MongoDB cluster which is working fine.
Your secondary server's eligibility is impossible due to replication lag.
Can you post your rs.status()'s output?
Your secondary server probably has a "could not find member to sync from" infoMessage.
I've run through something similar before due to bad RAM. It can be whatever.
Fix it by copying the primary server's data into another folder on the secondary and start a new instance on some other port on it, and then add it to the replica ( with the { force: true } options ) so the secondary server have somewhere to sync from.
You can also destroy the replica and create it again, but beware not to loose your replica's op-log.

Zooker Failover Strategies

We are young team building an applicaiton using Storm and Kafka.
We have common Zookeeper ensemble of 3 nodes which is used by both Storm and Kafka.
I wrote a test case to test zooker Failovers
1) Check all the three nodes are running and confirm one is elected as a Leader.
2) Using Zookeeper unix client, created a znode and set a value. Verify the values are reflected on other nodes.
3) Modify the znode. set value in one node and verify other nodes have the change reflected.
4) Kill one of the worker nodes and make sure the master/leader is notified about the crash.
5) Kill the leader node. Verify out of other two nodes, one is elected as a leader.
Do i need i add any more test case? additional ideas/suggestion/pointers to add?
From the documentation
Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.
more on setting up automatic failover