Cluster goes down after PC1 goes down - geode

i have 2 PCs that I run the following commands on both gfsh terminals:
start locator --name=locator1 --locators=ipaddress1[10334], ipaddress2[10334]
start server --name=server1 --locators=ipaddress1[10334], ipaddress2[10334]
after they start, i am able to see all 4 members on both terminals when i list members.
NOW:
Say i run these commands on PC1 first, then PC2 second. (so PC1 is the first online).
If i shutdown PC2, to simulate a PC failure, PC1 is ok. when i list members, it has 2 (locator and server).
I bring up PC2 and run the commands again and everything is good with 4 members again.
HOWEVER,
if i shutdown the PC1 (being the first PC in the original cluster startup), PC2 drops connection with everything shortly after(about 5 seconds). gfsh connection is dropped and I am unable to connect to local host at all, but the process ids for the server and locator are still running.
It says in the LOG(s) Membership Service Failure: Exiting due to possible network partition event due to loss of 2 cache processes.
When I bring PC1 back online and run the locator and server commands, then i can connect again on PC2.
Can anyone help me with this??? I am having a really hard time trying to figure out what is happening here.

Geode members automatically shutdown themselves whenever more than 52% of the membership quorum has been lost, basically to prevent split-brain situations and data corruption.
You can find more details about this in Network Partitioning.
Cheers.

Related

Prevent Data Loss during PostgreSQL Host Shutdown

So I've spent the better part of my day (and several searches before) looking for a workable solution to prevent data loss when the host of a PostgreSQL server installation gets rebooted or shut down. We maintain a number of Azure and on-prem servers and the number of times someone has inadvertently shut down the server without first ensuring Postgres is no longer flushing data to disk is far more frequent than it should be. Of note we are a Windows Server shop.
Our current best practice (which if followed appropriately works) is to stop the Postgres service, then watch disk writes to the Postgres data directory in Resource Monitor. Once nothing is writing to that directory, shut down the host. I have to think that there's a better way to ensure that it doesn't get shutdown in a manner that leads to data corruption, regardless of adherence to the best practice (or in some cases, because Windows Update mandates a reboot, regardless of configured settings telling it not to reboot).
Some things I've considered, but have been unable to find solid answers for:
Create a scheduled task that uses the "On an event" trigger to monitor the System log for event 1074. It would have to be configured to "run whether the user is logged in or not". The script would cancel the shutdown command with shutdown /a, then run a script to elegantly shutdown Postgres. I've seen mixed results on if the scheduled job would reliably trigger before Task Scheduler is terminated in the shutdown sequence.
Create a shutdown script using Group Policy. My question there is will it wait for the script to complete before executing the shutdown?
How do you deal with data loss in your Postgres server Windows hosts?
First, if you register PostgreSQL as a Windows service, a shutdown of the machine will automatically shut down PostgreSQL first.
But even without that, a properly configured PostgreSQL server on proper hardware will never suffer data loss (unless you hit a rare PostgreSQL software bug). It is one of the basic requirements for a relational database to survive crashes without data loss.
To enumerate a few things that come to mind:
make sure that the PostgreSQL parameters fsync and synchronous_commit are set to on
make sure that you are using a reliable file system for the data files and the WAL (a Windows network share is not a reliable file system)
make sure you are using storage that has no caches that are not battery-backed

Fabric Network - what happens when a downed peer connects back to the network?

I recently deployed the fabric network using Docker-compose, I was trying to simulate a downed peer. Essentially this is what happens:
4 peers are brought online using docker-compose running a fabric network
1 peer i.e the 4th peer goes down (done via docker stop command)
Invoke transactions are sent to the root peer which is verified by querying the peers after sometime (excluding the downed peer).
The downed peer is brought back up with docker start. Query transaction run fine on the always on peers but fail on the newly woken up peer.
Why isn't the 4th peer synchronizing the blockchain, once its up.Is there a step to be taken to ensure it does? Or is it discarded as a rogue peer.
This might be due to the expected behavior of PBFT (assuming you are using it). As explained on issue 933,
I think what you're seeing is normal PBFT behavior: 2f+1 replicas are
making progress, and f replicas are lagging slightly behind, and catch
up occasionally.
If you shut down another peer, you should observe
that the one you originally shut off and restarted will now
participate fully, and the network will continue to make progress. As
long as the network is making progress, and the participating nodes
share a correct prefix, you're all good. The reason for f replicas
lagging behind is that those f may be acting byzantine and progress
deliberately slowly. You cannot tell a difference between a slower
correct replica, and a deliberately slower byzantine replica.
Therefore we cannot wait for the last f stragglers. They will be left
behind and sync up occasionally. If it turns out that some other
replica is crashed, the network will stop making progress until one
correct straggler catches up, and then the network will progress
normally.
Hyperledger Fabric v0.6 does not support add peers dynamically. I am not sure for HF v1.0.

Wakanda Server solution.quitServer() sequence of operations

I have already read the thread:
Wakanda Server scripted clean shutdown
This does not address my question.
We are running Wakanda Server 11.197492.
We want an automated, orderly, ensured shut-down of Wakanda Server - no matter which version we are running.
Before we give the "shutdown" command, we will stop inbound traffic for 1 to 2 minutes, to ensure that no httpHandlers are running when we shut-down.
We have scripted a single SharedWorker process to look for the "shutdown" command, and execute solution.quitServer().
At this time no other ShareWorker processes are running, and no active threads should be executing. This will likely not always be the case.
When this is executed, is a "solution quit" guaranteed?
Is solution.quitServer() the best way to initiate an automated solution shutdown?
Will there be a better way?
Is there a way to know of any of the Solution's Projects are currently executing threads prior to shutting down?
If more than 1 Project issues a solution.quitServer() method, within a few seconds of eachother, will that be a problem?
solution.quitServer() is probably not the best way to shutdown your server as it will be deprecated in the next major release.
I would recommend to send a sigkill as you point out in your question.
Wakanda Server scripted clean shutdown
Some fix have been done on v1.1.0 to safely close wakanda server after a kill.

MongoDb Ops Manager can't start mongos and shard

I Came by a problem where i have an Ops Manager that suppose to run a MongoDB cluster as an automated cluster.
Suddenly the servers started going down, unexpectedly - while there are no errors in any of the log files indicating on when is the problem.
The Ops Manager gets stuck on the blue label
We are deploying your changes. This might take a few minutes
And it just never goes away.
Because this environment is based on the automation feature, the mms is managing the user on the servers and runs all of the processes from "mongod" which i can't access even as a Root (administrator).
As far as the Ops Manager goes it shows that a shard in a replica set is down although it's live, and thinks that a mongos that is dead is alive.
Has someone got into this situation before and may be able to help ?
Thanks,
Eliran.
Problem found: there was an ntp mismatch between the servers in the cluster somehow, so what happened was that the servers were not synced and everytime the ops manager did something it got responses with wrong times and could not use it's time limits.
After re-configuring all the ntp's back to the same one - everything got back to how it should have been :)

how do I change a socket connection to just timeout vs. being completely down?

I have a bug where a long running process works fine for the first few days, but then a query to redis reaches the 45 second timeout I have set. That is, if redis was totally down my program would just crash, but it doesn't. It waits and waits (45 seconds) timesout and tries again for another 45 seconds over and over.
If I stop the process and re-start it, everything is fine again for another few days.
This is running on ec2 with Elastic Load Balancing with my process on a different box than redis.
I need to re-create this situation on my local development environment. How can I not kill my local redis, but rather put it into a state where reads will timeout?
Maybe turn off the port? This might be interpreted as connections refused/down.
Perhaps put another non-redis app on said port and just have it not respond. In other words, accept incoming connections but don't respond. You could probably write a simple app that accepts TCP connections and then does nothing in the language of your choice, and have it start on the Redis port in order to test this situation.