Can k8s work with an even number of master nodes - kubernetes

I know it is recommended to have an odd number of master nodes. But will k8s work if we have an even number of nodes? And what are the downsides?
The reason I'm asking is that I'm building an IoT cluster, where every node is a master node. All devices are the same and any device must be able to take up the master role if the current master fails.
Also the number of devices could be any, so the system should work with both odd or even numbers of nodes.

https://discuss.kubernetes.io/t/high-availability-host-numbers/13143/2 says that you should avoid ever having more than 7 master nodes due to the overhead of membership algorithms so depending on how many IoT nodes you have, you should consider a different architecture.
Nodes are supposed to be abstracted away from their purpose so you shouldn't need your user nodes to be the system nodes and this might introduce tightly coupled problems later on.

Related

Kubernetes : Disadvantages of an all Master cluster

Hy !!
I was wondering if it could be possible to replicate an VMWare architecture in Kubernetes.
What I mean by that :
In place of having the Control-Panel always separated from the Worker Nodes, I would like to put them all together, at the end we would obtain a cluster of Master Nodes on which we can schedule applications. For now I'm using kata-container with containerd as such all applications are deployed in 'mini' VMs so there isn't the 'escape from the container' problem. The management of the Cluster would be done trough a special interface (eth0 1Gb). The users would be able to communicate with the apps that are deployed within the cluster trough another interface (eth1 10Gb). I would use Keepalived and HAProxy to elect my 'Main Master' and load balance the traffic.
The question might be 'why would you do that ?'. Well to assure High Availability at all time and reduce the management overhead, in place of having 2 sets of "entities" to manage (the control-plane and the worker nodes) simply reduce it to one, as such there won't be any problems such as 'I don't have more than 50% of my masters online so there won't be a leader elect', so now I would have to either eliminate master nodes from my cluster until the percentage of online master nodes > 50%, that would ask for technical intervention and as fast as possible which might result in human errors etc..
Another positive point would be the scaling, in place of having 2 parts of the cluster that I would need to scale (masters and workers) there would be only one, I would need to add another master/worker to the cluster and that's it. All the management traffic would be redirected to the Main Master that uses a Virtual IP (VIP) and in case of an overcharge the request would be redirected to another Node.
In the end I would have something resembling to this :
Photo - Architecture VMWare-like
I try to find disadvantages to this kind of architecture, I know that there would be etcd traffic on each Node but how impactful is it ? I know that there will be wasted resources for the Pods of the control-plane on each node, but knowing that these pods (except etcd) wont do much beside waiting, how impactful would it be ? Having each Node being capable to take the Master role there won't be any down time. Right now if my control-plane (3 masters) go down I have to reboot them or find the solution as fast as possible before there's a problem with one of the apps that turn on the worker Nodes.
The topology I'm using right now resembles the following :
Architecture basic Kubernetes
I'm new to kuberentes so the question might be seen as stupid but I would really like to know the advantages/disadvantages between the two and understand why it wouldn't be a good idea.
Thanks a lot for any help !! :slightly_smiling_face:
There are two reasons for keeping control planes on their own. The big one is that you only want a small number of etcd nodes, usually 3 or 5 and that's usually the bounding factor on the size of the control plane. You usually want the ability to scale worker nodes independently from that. The second issue is Etcd is very sensitive to IOPS brownouts and can get bad cascade failures if the machine runs low on IOPS.
And given that you are doing things on top of VMWare anyway, the overhead of managing 3 vs 6 VMs is not generally a difference in kind. This seems like a false savings in the long run.

ActiveMQ Artemis cluster failover questions

I have a question in regards to Apache Artemis clustering with message grouping. This is also done in Kubernetes.
The current setup I have is 4 master nodes and 1 slave node. Node 0 is dedicated as LOCAL to handle message grouping and node 1 is the dedicated backup to node 0. Nodes 2-4 are REMOTE master nodes without backup nodes.
I've noticed that clients connected to nodes 2-4 is not failing over to the 3 other master nodes available when the connected Artemis node goes down, essentially not discovering the other nodes. Even after the original node comes back up, the client continues to fail to establish a connection. I've seen from a separate Stack Overflow post that master-to-master failover is not supported. Does this mean for every master node I need to create a slave node as well to handle the failover? Would this cause a two instance point of failure instead of however many nodes are within the cluster?
On a separate basic test using a cluster of two nodes with one master and one slave, I've observed that when I bring down the master node clients are connected to, the client doesn't failover to the slave node. Any ideas why?
As you note in your question, failover is only supported between a live and a backup. Therefore, if you wanted failover for clients which were connected to nodes 2-4 then those nodes would need backups. This is described in more detail in the ActiveMQ Artemis documentation.
It's worth noting that clustering and message grouping, while technically possible, is a somewhat odd pairing. Clustering is a way to improve overall message throughput using horizontal scaling. However, message grouping naturally serializes message consumption for each group (to maintain message order) which then decreases overall message throughput (perhaps severely depending on the use-case). A single ActiveMQ Artemis node can potentially handle millions of messages per second. It may be that you don't need the increased message throughput of a cluster since you're grouping messages.
I've often seen users simply assume they need a cluster to deal with their expected load without actually conducting any performance benchmarking. This can potentially lead to higher costs for development, testing, administration, and (especially) hardware, and in some use-cases it can actually yield worse performance. Please ensure you've thoroughly benchmarked your application and broker architecture to confirm the proposed design.

Kubernetes: why would you need more than 2 nodes?

Given a K8s Cluster(managed cluster for example AKS) with 2 worker nodes, I've read that if one node fails all the pods will be restarted on the second node.
Why would you need more than 2 worker nodes per cluster in this scenario? You always have the possibility to select the number of nodes you want. And the more you select the more expensive it is.
It depends on the solution that you are deploying in the kubernetes cluster and the nature of high-availability that you want to achieve
If you want to work on an active-standby mode, where, if one node fails, the pods would be moved to other nodes, two nodes would work fine (as long as the single surviving node has the capacity to run all the pods)
Some databases / stateful applications, for instance, need minimum of three replica, so that you can reconcile if there is a mismatch/conflict in data due to network partition (i.e. you can pick the content held by two replicas)
For instance, ETCD would need 3 replicas
If whatever you are building needs only two nodes, then you wouldn't need more than 2. If you are building anything big where the amount of compute, memory needed is much more, then instead of opting for expensive nodes with huge CPU and RAM, you could instead join more and more lower priced nodes to the cluster. This is called horizontal scaling.

How to avoid loss of internal state of a master during fail-over to new master during a network partition

I was trying to implement a simple single master node against multiple backup nodes system to learn about distributed and fault tolerant architecture.
Currently this is what my system looks like:
N different nodes, each one identical. 1 master node running a simple webserver.
All nodes communicate with each other using simple heartbeat protocol and each maintain global state (count of nodes available, who is master, downtime and uptime of each other.)
If any node does not hear from master for some set time, if raises a alarm. If a consensus is reached that the master is down, new master is elected.
If the network of nodes gets partitioned.
And the master is in minor partition, then it will stop serving request and go down by itself after a set period of time. Minor group cannot elect master (some minimum nodes require to make decision)
New master gets selected in the major partition after a set time after not hearing from old master.
Now I am stuck with a problem, that is, in the step 4 above, there is a time gap where the old master is still serving the requests, while new master getting elected in the major node.
This seems can cause inconsistent data across the system if some client decided to write new data to old master. How we avoid this issue. Would be glad if someone points me to right direction.
Rather than accepting writes to the minority master, what you want is to simply reject writes to the old master in that case, and you can do so by attempting to verify its mastership with a majority of the cluster on each write. If the master is on the minority side of a partition, it will no longer be able to contact a majority of the cluster and so will not be able to acknowledge clients’ requests. This brief period of unavailability is preferable to losing acknowledged writes in quorum based systems.
You should read the Raft paper. You’re slowly moving towards an implementation of the Raft protocol, and it will probably answer many of the questions you might have alonggn the way.

Single Kubernetes/OpenShift cluster/instance across datacenters?

With the understanding that Ubernetes is designed to fully solve this problem, is it currently possible (not necessarily recommended) to span a single K8/OpenShift cluster across multiple internal corporate datacententers?
Additionally assuming that latency between data centers is relatively low and that infrastructure across the corporate data centers is relatively consistent.
Example: Given 3 corporate DC's, deploy 1..* masters at each datacenter (as a single cluster) and have 1..* nodes at each DC with pods/rc's/services/... being spun up across all 3 DC's.
Has someone implemented something like this as a stop gap solution before Ubernetes drops and if so, how has it worked and what would be some considerations to take into account on running like this?
is it currently possible (not necessarily recommended) to span a
single K8/OpenShift cluster across multiple internal corporate
datacententers?
Yes, it is currently possible. Nodes are given the address of an apiserver and client credentials and then register themselves into the cluster. Nodes don't know (or care) of the apiserver is local or remote, and the apiserver allows any node to register as long as it has valid credentials regardless of where the node exists on the network.
Additionally assuming that latency between data centers is relatively
low and that infrastructure across the corporate data centers is
relatively consistent.
This is important, as many of the settings in Kubernetes assume (either implicitly or explicitly) a high bandwidth, low-latency network between the apiserver and nodes.
Example: Given 3 corporate DC's, deploy 1..* masters at each
datacenter (as a single cluster) and have 1..* nodes at each DC with
pods/rc's/services/... being spun up across all 3 DC's.
The downside of this approach is that if you have one global cluster you have one global point of failure. Even if you have replicated, HA master components, data corruption can still take your entire cluster offline. And a bad config propagated to all pods in a replication controller can take your entire service offline. A bad node image push can take all of your nodes offline. And so on. This is one of the reasons that we encourage folks to use a cluster per failure domain rather than a single global cluster.