I am planning the creation of a PostgreSQL HA cluster that spans multiple data centres on different continents and trying to figure out how to tweak the election parameters in etcd and patron so that we are unlikely to failover accidentally to a continent from our app servers unless the databases closer to the application servers are down.
So far in my research I have concluded that I should be able to tweak the election timeout settings in etc so that the variance between the servers is less than the latency of the intercontinental hop. This should help on the etcd side with helping to prevent far-away servers deciding to take over. However.... how do I prevent the same on the Patroni side? Is there a best practice for handicapping servers we want to avoid being put into the master role?
Related
I recently started to learn more about service registries and their usage in distributed architecture.
All the applications providing service registries that I found (etcd, Consul, or Zookeeper) are based on the same model: a master-server/cluster with leader election.
Correct me if I'm wrong but... doesn't this make the architecture less reliable ? In the sense that the master cluster brings a point-of-failure. To circumvent this we could always make a bigger cluster but it's more costly and/or less-performance effective.
My questions here are:
as all these service registries elect a leader — wouldn't it be possible to do the same without specifying the machines that form the master cluster but rather let them discover themselves through broadcasting and elect a leader or a leading group ?
does a service registry without master-server/cluster exists ?
and if not, what are the current limitations that prevent us from doing this ?
All of those services are based on one whitepaper - Google Chubby(https://ai.google/research/pubs/pub27897). The idea is to have fast and consistent configuration storage for distributed systems. To get there you need to eliminate a single point of failure. How you can do that? You introduce multiple machines storing the same data and also replicate the data. But in that case, considering unreliable network between those machines, how do you make sure that the data is consistent among nodes? You choose one of the nodes from the cluster to be Leader(using distributed leader election algorithm), if nodes have inconsistent values between them, it's a leaders job to pick the "correct" one. It looks like we've returned to a "single point of failure" situation, but in reality if the leader fails, the rest of the cluster just votes and promotes a new leader. So Leader role in those systems is NOT to be a Single point of truth, but rather a Single point of decision making
Scenario: you have a Kafka-Cluster in different DCs but they are configured as one cluster. So there is no mirroring through MirrorMaker or something liket hat. The DCs are not very far from eatch other. But they are physically separated.
Now what do you have to do to ensure that the cluster is failsafe on BOTH SIDES if the connection between those two DCs is down? So on BOTH sides the producers and consumer should still work.
I would guess: you need multiple Zookeepers on both sides and multiple Kafka-Nodes.
But is that enough? Does the cluster heal itself after getting reconnected?
Thanks in advance.
I'm assuming your datacenters that "are not very far from eatch other" are basically Availability Zones (AZs).
It's pretty common to spread a cluster over multiples AZs. However it's usually not desired or possible that each "slice" can live on its own.
The immediate issue is Zookeeper which by design prevents split-brain scenarios. So if a ZK cluster is split only one "slice" (at best) will carry on working. So the brokers that are on a side of the non working ZK clusters will be non functional.
Then let's say it was possible to have both sides keep working. What happens when you joins both sides again?
Data is likely to have diverged as clients wrote data to each side separately. Now you could have the same partition with different messages for the same offset and no way to resolve the conflict as both options are "valid".
I hope this shows why this is not a possible solution. In practice, if an AZ goes offline, it is non functional until it is brought back online.
Clients that were connected to the offline AZ should reconnect to the other AZ (using multiple bootstrap servers) and clients that were in the failed AZ should be reprovisioned into another one.
If configured correctly, Kafka can survive an AZ outage (even though in practice, it's best to have 3 AZs) and keep all resources available. Also in this scenario, the cluster will automatically return to a good state when the failed AZ returns.
What is the best practice to get Geo distributed cluster with asynchronous network channels ?
I suspect I would need to have some "load balancer" which should redirect connections "within" it's own DC, do you know anything like this already in place?
Second question, should we use one HA cluster or create dedicated cluster for each of the DC ?
The assumption of the kubernetes development team is that cross-cluster federation will be the best way to handle cross-zone workloads. The tooling for this is easy to imagine, but has not emerged yet. You can (on your own) set up regional or global load-balancers and direct traffic to different clusters based on things like GeoIP.
You should look into Byzantine Clients. My team is currently working on a solution for erasure coded storage in asynchronous network that prevents some problems caused by faulty clients, but it relies on correct clients to establish a consistent state across the servers.
The network consists of a set of servers {P1, ...., Pn} and a set of clients {C1, ..., Cn}, which are all PTIM with running time bounded by a polynomial in a given securty parameter. Servers and clients together are parties. Theres an adversary, which is a PITM with running time boundded by a polynoil. Servers nd clients are controlled by adversary. In this case, theyre calld corruptd, othrwise, theyre called honest. An adversary that contrls up to t servers is called t-limited.
If protecting innocent clients from getting inconsistent values is a priority, then you should go ne, but from the pointview of a client, problems caused by faulty clients don't really hurt the system.
If find the replica set requirement a bit confusing, and I'm probably missing something obvious (like under which condition there are elections).
I understand that in normal operations you need quorum, and a voting takes place and to get a majority you need and odd numbers of machines.
But since we use a replica set for failover, if the master dies, then we are left with an even number of voting members, which based on my limited experience lengthen the time to elect a primary.
Also according to the documentation, the addition of a voting member doesn't start an election, it would seem that starting (booting) you replica set with an even number of nodes would make more sense?
So if we start say with 4 machines in the replica set, and one machine dies, there is a re-election with 3 machines, fast quorum. We add a machine back to get back to our normal operation state, no re-election and we are back to our normal operation conditions.
Can someone shed a light on this?
TL;DR: With single master systems, even partitions make it impossible to determine which remainder still has a majority, taking both systems down.
Let N be a cluster of four machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
Let M be a cluster of three machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
=> Same result at 3/4 of the cost.
Now, let's add an assumption or two:
We're also going to operate some kind of server application that uses the database
The network can be partitioned
Let's say you have two datacenters, one with two database instances and the backend server machines. If the connection to the backup center (which has one MongoDB instance) fails, you're still online.
Now if you added a second MongoDB instance at the backup data center, a network partition would, despite seemingly higher redundancy, yield lower availability since we'd lose the majority in case of a network partition and can't continue to operate.
=> Less availability at higher cost. But that doesn't answer the question yet.
Let's say you're really worried about availability: You have two data centers, with backend servers in both datacenters, anycast IPs, the whole deal. Now the network between the two DCs is partitioned, but some clients connect to DC A while other reach DC B. How do you now determine which datacenter may accept writes? It's not possible - this is why the odd number is necessary.
You don't actually need Anycast IPs, BGP or any fancy stuff for the problem to become real, any writing application (like a worker, a stale request, anything) would require later merging different writes, which is a completely different concurrency scheme.
Use case: 100 Servers in a pool; I want to start a ZooKeeper service on each Server and Server applications (ZooKeeper client) will use the ZooKeeper cluster (read/write). Then there is no single point of failure.
Is this solution possible for this use case? What about the performance?
What if there are 1000 Servers in the pool?
If you are simply trying to avoid a single point of failure, then you only need 3 servers. In a 3 node ensemble, a single failure can be tolerated with the remaining 2 nodes forming the quorum. The more servers you have the worse write performance will be. And 100 servers is the extreme of this, if ZK can even handle it.
However, having that many clients is no problem at all. Zookeeper has active deployments with many more than 1000 clients. If you find that you need more servers to handle your read load, you can always add Observers. I highly recommend you join the list serve. It is an excellent way to quickly have your questions answered, and likely in much more detail than anyone will give you on SO.
Maybe zookeeper is not the right tool?
Hazelcast does what you want, I think. You can hundreds of peers, and if the master is lost a new one is elected from all the peers.
You don't need to use all of hazel cast. You can just use the maps, or just the worker pools, or just the synchronisation primitives, or just the messaging etc.