Kubernetes : Disadvantages of an all Master cluster - kubernetes

Hy !!
I was wondering if it could be possible to replicate an VMWare architecture in Kubernetes.
What I mean by that :
In place of having the Control-Panel always separated from the Worker Nodes, I would like to put them all together, at the end we would obtain a cluster of Master Nodes on which we can schedule applications. For now I'm using kata-container with containerd as such all applications are deployed in 'mini' VMs so there isn't the 'escape from the container' problem. The management of the Cluster would be done trough a special interface (eth0 1Gb). The users would be able to communicate with the apps that are deployed within the cluster trough another interface (eth1 10Gb). I would use Keepalived and HAProxy to elect my 'Main Master' and load balance the traffic.
The question might be 'why would you do that ?'. Well to assure High Availability at all time and reduce the management overhead, in place of having 2 sets of "entities" to manage (the control-plane and the worker nodes) simply reduce it to one, as such there won't be any problems such as 'I don't have more than 50% of my masters online so there won't be a leader elect', so now I would have to either eliminate master nodes from my cluster until the percentage of online master nodes > 50%, that would ask for technical intervention and as fast as possible which might result in human errors etc..
Another positive point would be the scaling, in place of having 2 parts of the cluster that I would need to scale (masters and workers) there would be only one, I would need to add another master/worker to the cluster and that's it. All the management traffic would be redirected to the Main Master that uses a Virtual IP (VIP) and in case of an overcharge the request would be redirected to another Node.
In the end I would have something resembling to this :
Photo - Architecture VMWare-like
I try to find disadvantages to this kind of architecture, I know that there would be etcd traffic on each Node but how impactful is it ? I know that there will be wasted resources for the Pods of the control-plane on each node, but knowing that these pods (except etcd) wont do much beside waiting, how impactful would it be ? Having each Node being capable to take the Master role there won't be any down time. Right now if my control-plane (3 masters) go down I have to reboot them or find the solution as fast as possible before there's a problem with one of the apps that turn on the worker Nodes.
The topology I'm using right now resembles the following :
Architecture basic Kubernetes
I'm new to kuberentes so the question might be seen as stupid but I would really like to know the advantages/disadvantages between the two and understand why it wouldn't be a good idea.
Thanks a lot for any help !! :slightly_smiling_face:

There are two reasons for keeping control planes on their own. The big one is that you only want a small number of etcd nodes, usually 3 or 5 and that's usually the bounding factor on the size of the control plane. You usually want the ability to scale worker nodes independently from that. The second issue is Etcd is very sensitive to IOPS brownouts and can get bad cascade failures if the machine runs low on IOPS.
And given that you are doing things on top of VMWare anyway, the overhead of managing 3 vs 6 VMs is not generally a difference in kind. This seems like a false savings in the long run.


Proper Fault-tolerant/HA setup for KeyDB/Redis in Kubernetes

Sorry for a long post, but I hope it would relieve us from some of clarifying questions. I also added some diagrams to split the wall of text, hope you'll like those.
We are in the process of moving our current solution to local Kubernetes infrastructure, and the current thing we investigate is the proper way to setup a KV-store (we've been using Redis for this) in the K8s.
One of the main use-cases for the store is providing processes with exclusive ownership for resources via a simple version of a Distibuted lock pattern, as in (discouraged) pattern here. (More on why we are not using Redlock below).
And once again, we are looking for a way to set it in the K8s, so that details of HA setup are opaque to clients. Ideally, the setup would look like this:
So what is the proper way to setup Redis for this? Here are the options that we considered:
First of all, we discarded Redis cluster, because we don't need sharding of keyspace. Our keyspace is rather small.
Next, we discarded Redis Sentinel setup, because with sentinels clients are expected to be able to connect to chosen Redis node, so we would have to expose all nodes. And also will have to provide some identity for each node (like distinct ports, etc) which contradicts with idea of a K8s Service. And even worse, we'll have to check that all (heterogeneous) clients do support Sentinel protocol and properly implement all that fiddling.
Somewhere around here we got out of options for the first time. We thought about using regular Redis replication, but without Sentinel it's unclear how to set things up for fault-tolerance in case of master failure — there seem to be no auto-promotion for replicas, and no (easy) way to tell K8s that master has been changed — except maybe for inventing a custom K8s operator, but we are not that desperate (yet).
So, here we came to idea that Redis may be not very cloud-friendly, and started looking for alternatives. And so we found KeyDB, which has promising additional modes. That's besides impressing performance boost while having 100% compatible API — very impressive!
So here are the options that we considered with KeyDB:
Active replication with just two nodes. This would look like this:
This setup looks very promising at first — simple, clear, and even official KeyDB docs recommend this as a preferred HA setup, superior to Sentinel setup.
But there's a caveat. While the docs advocate this setup to be tolerant to split-brains (because the nodes would catch up one to another after connectivity is re-established), this would ruin our use-case, because two clients would be able to lock same resource id:
And there's no way to tell K8s that one node is OK, and another is unhealthy, because both nodes have lost their replicas.
Well, it's clear that it's impossible to make an even-node setup to be split-brain-tolerant, so next thing we considered was KeyDB 3-node multi-master, which allows each node to be an (active) replica of multiple masters:
Ok, things got more complicated, but it seems that the setup is brain-split proof:
Note that we had to add more stuff here:
health check — to consider a node that lost all its replicas as unhealthy, so K8s load balancer would not route new clients to this node
WAIT 1 command for SET/EXPIRE — to ensure that we are writing to a healthy split (preventing case when client connects to unhealthy node before load balancer learns it's ill).
And this is when a sudden thought struck: what's about consistency?? Both these setups with multiple writable nodes provide no guard against two clients both locking same key on different nodes!
Redis and KeyDB both have asynchronous replication, so there seem to be no warranty that if an (exclusive) SET succeeds as a command, it would not get overwritten by another SET with same key issued on another master a split-second later.
Adding WAITs does not help here, because it only covers spreading information from master to replicas, and seem to have no affect on these overlapping waves of overwrites spreading from multiple masters.
Okay now, this is actually the Distributed Lock problem, and both Redis and KeyDB provide the same answer — use the Redlock algorithm. But it seem to be quite too complex:
It requires client to communicate with multiple nodes explicitly (and we'd like to not do that)
These nodes are to be independent. Which is rather bad, because we are using Redis/KeyDB not only for this locking case, and we'd still like to have a reasonably fault-tolerant setup, not 5 separate nodes.
So, what options do we have? Both Redlock explanations do start from a single-node version, which is OK, if the node will never die and is always available. And while it's surely not the case, but we are willing to accept the problems that are explained in the section "Why failover-based implementations are not enough" — because we believe failovers would be quite rare, and we think that we fall under this clause:
Sometimes it is perfectly fine that under special circumstances, like during a failure, multiple clients can hold the lock at the same time. If this is the case, you can use your replication based solution.
So, having said all of this, let me finally get to the question: how do I setup a fault-tolerant "replication-based solution" of KeyDB to work in Kubernetes, and having a single write node most of the time?
If it's a regular 'single master, multiple replicas' setup (without 'auto'), what mechanism would assure promoting replica in case of master failure, and what mechanism would tell Kubernetes that master node has changed? And how? By re-assigning labels on pods?
Also, what would restore a previously dead master node in such a way that it would not become a master again, but a replica of a substitute master?
Do we need some K8s operator for this? (Those that I found were not smart enough to do this).
Or if it's multi-master active replication from KeyDB (like in my last picture above), I'd still need to use something instead of LoadBalanced K8s Service, to route all clients to a single node at time, and then again — to use some mechanism to switch this 'actual master' role in case of failure.
And this is where I'd like to ask for your help!
I've found frustratingly little info on the topic. And it does not seem that many people have such problems that we face. What are we doing wrong? How do you cope with Redis in the cloud?

Can k8s work with an even number of master nodes

I know it is recommended to have an odd number of master nodes. But will k8s work if we have an even number of nodes? And what are the downsides?
The reason I'm asking is that I'm building an IoT cluster, where every node is a master node. All devices are the same and any device must be able to take up the master role if the current master fails.
Also the number of devices could be any, so the system should work with both odd or even numbers of nodes.
https://discuss.kubernetes.io/t/high-availability-host-numbers/13143/2 says that you should avoid ever having more than 7 master nodes due to the overhead of membership algorithms so depending on how many IoT nodes you have, you should consider a different architecture.
Nodes are supposed to be abstracted away from their purpose so you shouldn't need your user nodes to be the system nodes and this might introduce tightly coupled problems later on.

schedule kubernetes pods on different physical server

In my cluster there are 30 VMs which are located in 3 different physical servers. I want to deploy different replicas of each workload on different physical server.
I know I can use podAntiAffinity to deploy replicas on different VMs but I cant find any way to guarantee spread replication on different physical server.
I want to know is there any way to solve this challenge?
I believe you gave the answer ;)
I went to the Kubernetes Patterns book (PDF available for free in here) to see if there was something related to that over there, and found exactly that:
To express how Pods should be spread to achieve high availability, or be packed and co-located together to improve latency, Pod affinity and antiaffinity can be used.
Node affinity works at node granularity, but Pod affinity is not limited to nodes and
can express rules at multiple topology levels. Using the topologyKey field, and the
matching labels, it is possible to enforce more fine-grained rules, which combine
rules on domains like node, rack, cloud provider zone, and region [...]
I really like the k8s docs as well, they are super complete and full of examples, so maybe you can get some ideas from here. I think the main idea will be to create your own affinity/antiaffinity rule.
----------------------------------- EDIT -----------------------------------
There is a new feature within k8s version 1.18 that may be a better solution.
It's called: Pod Topology Spread Constraints:
You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.

Multiple node pools vs single pool with many machines vs big machines

We're moving all of our infrastructure to Google Kubernetes Engine (GKE) - we currently have 50+ AWS machines with lots of APIs, Services, Webapps, Database servers and more.
As we have already dockerized everything, it's time to start moving everything to GKE.
I have a question that may sound too basic, but I've been searching the Internet for a week and did not found any reasonable post about this
Straight to the point, which of the following approaches is better and why:
Having multiple node pools with multiple machine types and always specify in which pool each deployment should be done; or
Having a single pool with lots of machines and let Kubernetes scheduler do the job without worrying about where my deployments will be done; or
Having BIG machines (in multiple zones to improve clusters' availability and resilience) and let Kubernetes deploy everything there.
List of consideration to be taken merely as hints, I do not pretend to describe best practice.
Each pod you add brings with it some overhead, but you increase in terms of flexibility and availability making failure and maintenance of nodes to be less impacting the production.
Nodes too small would cause a big waste of resources since sometimes will be not possible to schedule a pod even if the total amount of free RAM or CPU across the nodes would be enough, you can see this issue similar to memory fragmentation.
I guess that the sizes of PODs and their memory and CPU request are not similar, but I do not see this as a big issue in principle and a reason to go for 1). I do not see why a big POD should run merely on big machines and a small one should be scheduled on small nodes. I would rather use 1) if you need a different memoryGB/CPUcores ratio to support different workloads.
I would advise you to run some test in the initial phase to understand which is the size of the biggest POD and the average size of the workload in order to properly chose the machine types. Consider that having 1 POD that exactly fit in one node and assign to it is not the right to proceed(virtual machine exist for this kind of scenario). Since fragmentation of resources would easily cause to impossibility to schedule a large node.
Consider that their size will likely increase in the future and to scale vertically is not always this immediate and you need to switch off machine and terminate pods, I would oversize a bit taking this issue into account and since scaling horizontally is way easier.
Talking about the machine type you can decide to go for a machine 5xsize the biggest POD you have (or 3x? or 10x?). Oversize a bit as well the numebr of nodes of the cluster to take into account overheads, fragmentation and in order to still have free resources.
Remember that you have an hard limit of 100 pods each node and 5000 nodes.
Remember that in GCP the network egress throughput cap is dependent on the number of vCPUs that a virtual machine instance has. Each vCPU has a 2 Gbps egress cap for peak performance. However each additional vCPU increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine.
Regarding the prices of the virtual machines notice that there is no difference in price buying two machines with size x or one with size 2x. Avoid to customise the size of machines because rarely is convenient, if you feel like your workload needs more cpu or mem go for HighMem or HighCpu machine type.
P.S. Since you are going to build a pretty big Cluster, check the size of the DNS
I will add any consideration that it comes to my mind, consider in the future to update your question with the description of the path you chose and the issue you faced.
1) makes a lot of sense as if you want, you can still allow kube deployments treat it as one large pool (by not adding nodeSelector/NodeAffinity) but you can have different machines of different sizes, you can think about having a pool of spot instances, etc. And, after all, you can have pools that are tainted and so forth excluded from normal scheduling and available to only a particular set of workloads. It is in my opinion preferred to have some proficiency with this approach from the very beginning, yet in case of many provisioners it should be very easy to migrate from 2) to 1) anyway.
2) As explained above, it's effectively a subset of 1) so better to build up exp with 1) approach from day 1, but if you ensure your provisioning solution supports easy extension to 1) model then you can get away with starting with this simplified approach.
3) Big is nice, but "big" is relative. It depends on the requirements and amount of your workloads. Remember that while you need to plan for loss of a whole AZ anyway, it will be much more frequent to loose single nodes (reboots, decommissions of underlying hardware, updates etc.) so if you have more hosts, impact of loosing one will be smaller. Bottom line is that you need to find your own balance, that makes sense for your particular scale. Maybe 50 nodes is too much, would 15 cut it? Who knows but you :)

Single Kubernetes/OpenShift cluster/instance across datacenters?

With the understanding that Ubernetes is designed to fully solve this problem, is it currently possible (not necessarily recommended) to span a single K8/OpenShift cluster across multiple internal corporate datacententers?
Additionally assuming that latency between data centers is relatively low and that infrastructure across the corporate data centers is relatively consistent.
Example: Given 3 corporate DC's, deploy 1..* masters at each datacenter (as a single cluster) and have 1..* nodes at each DC with pods/rc's/services/... being spun up across all 3 DC's.
Has someone implemented something like this as a stop gap solution before Ubernetes drops and if so, how has it worked and what would be some considerations to take into account on running like this?
is it currently possible (not necessarily recommended) to span a
single K8/OpenShift cluster across multiple internal corporate
Yes, it is currently possible. Nodes are given the address of an apiserver and client credentials and then register themselves into the cluster. Nodes don't know (or care) of the apiserver is local or remote, and the apiserver allows any node to register as long as it has valid credentials regardless of where the node exists on the network.
Additionally assuming that latency between data centers is relatively
low and that infrastructure across the corporate data centers is
relatively consistent.
This is important, as many of the settings in Kubernetes assume (either implicitly or explicitly) a high bandwidth, low-latency network between the apiserver and nodes.
Example: Given 3 corporate DC's, deploy 1..* masters at each
datacenter (as a single cluster) and have 1..* nodes at each DC with
pods/rc's/services/... being spun up across all 3 DC's.
The downside of this approach is that if you have one global cluster you have one global point of failure. Even if you have replicated, HA master components, data corruption can still take your entire cluster offline. And a bad config propagated to all pods in a replication controller can take your entire service offline. A bad node image push can take all of your nodes offline. And so on. This is one of the reasons that we encourage folks to use a cluster per failure domain rather than a single global cluster.