If Kube proxy is down, the pods on a kubernetes node will not be able to communicate with the external world. Anything that Kubernetes does specially to guarantee the reliability of kube-proxy?
Similarly, how does Kubernetes guarantee reliability of kubelet?
It guarantees their reliability by:
Having multiple nodes: If one kubelet crashes, one node goes down. Similarly, every node runs a kube-proxy instance, which means losing one node means losing the kube-proxy instance on that node. Kubernetes is designed to handle node failures. And if you designed your app that is running on Kubernetes to be scalable, you will not be running it as single instance but rather as multiple instances - and kube-scheduler will distribute your workload across multiple nodes - which means your application will still be accessible.
Supporting a Highly-Available Setup: If you set up your Kubernetes cluster in High-Availability mode properly, there won't be one master node, but multiple. This means, you can even tolerate losing some master nodes. The managed Kubernetes offerings of the cloud providers are always highly-available.
These are the first 2 things that come to my mind. However, this is a broad question, so I can go into details if you elaborate what you mean by "reliability" a bit.
Related
I want to understand what could be the possible impact of a master node failure in a k8s cluster with only one master node with internal etcd store.
As per my understanding, all kinds of deployed workload containers (including stateless and stateful sets with persistent volume claims) running on worker nodes would keep on running until recreation of any container is required as they don't have a direct functional dependency on the master node and etcd store for their core functions. And, the unavailability of the master node only affects the control plane operations for the cluster.
Is my understanding correct? If not, could you please explain the impact of the master node failure on my workload running on that cluster?
I understand that the best way to achieve HA for k8s cluster is to set up a multi-master cluster with possibly externalizing etcd stores also for decoupling of them. This question is to understand the exact impact of the master node failure to take an informed call before configuring a multi-master cluster.
Etcd operators on the quorum system so as long as the cluster sees a majority it will continue operating. If the failed node was the current leader, the others would trigger an election after the heartbeat timeout.
For kube-apiserver, it's a horizontal service so losing a node is not interesting, just like any other webapp. Some (most) controllers are singletons, but they run on every control plane node and use kube-apiserver for leader elections so as with Etcd, if the leader dies then a few seconds later another copy will get the leader lock and take over.
From what I've read about Kubernetes, if the master(s) die, the workers should still be able to function as normal (https://stackoverflow.com/a/39173007/281469), although no new scheduling will occur.
However, I've found this to not be the case when the master can also schedule worker pods. Take a 2-node cluster, where one node is a master and the other a worker, and the master has the taints removed:
If I shut down the master and docker exec into one of the containers on the worker I can see that:
nc -zv ip-of-pod 80
succeeds, but
nc -zv ip-of-service 80
fails half of the time. The Kubernetes version is v1.15.10, using iptables mode for kube-proxy.
I'm guessing that since the kube-proxy on the worker node can't connect to the apiserver, it will not remove the master node from the iptables rules.
Questions:
Is it expected behaviour that kube-proxy won't stop routing to pods on master nodes, or is there something "broken"?
Are any workarounds available for this kind of setup to allow the worker nodes to still function correctly?
I realise the best thing to do is separate the CP nodes but that's not viable for what I'm working on at the moment.
Is it expected behaviour that kube-proxy won't stop routing to pods on
master nodes, or is there something "broken"?
Are any workarounds
available for this kind of setup to allow the worker nodes to still
function correctly?
The cluster master plays the role of decision maker for the various activities in cluster's nodes. This can include scheduling workloads, managing the workloads' lifecycle, scaling etc.. Each node is managed by the master components and contains the services necessary to run pods. The services on a node typically includes the kube-proxy, container runtime and kubelet.
The kube-proxy component enforces network rules on nodes and helps kubernetes in managing the connectivity among Pods and Services. Also, the kube-proxy, acts as an egress-based load-balancing controller which keeps monitoring the the kubernetes API server and continually updates node's iptables subsystem based on it.
In simple terms, the master node only is aware of everything and is in charge of creating the list of routing rules as well based on node addition or deletion etc. kube-proxy plays a kind of enforcer whereby it takes charge of checking with master, syncing the information and enforcing the rules on the list.
If the master node(API server) is down, the cluster will not be able to respond to API commands or deploy nodes. If another master node is not available, there shall be no one else available who can instruct the worker nodes on change in work allocation and hence they shall continue to execute the operations that were earlier scheduled by the master until the time the master node is back and gives different instructions. Inline to it, kube-proxy shall also be unable to get the latest rules by sync up with master, however it shall not stop routing and shall continue to handle the networking and routing functionalities (uses the earlier iptable rules that were determined before the master node went down) that shall allow network communication to your pods provided all pods in worker nodes are still up and running.
Single master node based architecture is not a preferred deployment architecture for production. Considering that resilience and reliability is one of the major business goal of kubernetes, it is recommended as a best practice to have HA cluster based architecture to avoid single point of failure.
Once you remove taints, kubernetes scheduler don't need any tolerations to schedule pods on your master node. So it is as good as your worker node with control plane components running on it and you can also run your workload pods on this node (although its not a recommended practice).
Kube-proxy (https://kubernetes.io/docs/concepts/overview/components/#kube-proxy) is the component deployed on all the nodes of cluster and it handles the networking and routing connection to your pods. So, even if your master node is down kube-proxy still works fine on the worker node and it will route traffic to your pods running on worker node.
If all your pods are running in worker nodes (which are still up and running), then kube-proxy will continue to route traffic to your pods even via service.
There is nothing inherent in Kubernetes that would cause this. The master node role is just for humans, and if you've removed the taints then the nodes are just normal nodes. That said, remember that usual rules about scheduling and resource requests apply so if your pods don't all fit then things wouldn't be scheduled. It's possible your Kubernetes deploy system set up more specialized firewall rules or similar around the control plane nodes, but that would be dependent on that system.
I have played around a little bit with docker and kubernetes. Need some advice here on - Is it a good idea to have one POD on a VM with all these deployed in multiple (hybrid) containers?
This is our POC plan:
Customers to access (nginx reverse proxy) with a public API endpoint. eg., abc.xyz.com or def.xyz.com
List of containers that we need
Identity server Connected to SQL server
Our API server with Hangfire. Connected to SQL server
The API server that connects to Redis Server
The Redis in turn has 3 agents with Hangfire load-balanced (future scalable)
Setup 1 or 2 VMs?
Combination of Windows and Linux Containers, is that advisable?
How many Pods per VM? How many containers per Pod?
Should we attach volumes for DB?
Thank you for your help
Cluster size can be different depending on the Kubernetes platform you want to use. For managed solutions like GKE/EKS/AKS you don't need to create a master node but you have less control over our cluster and you can't use latest Kubernetes version.
It is safer to have at least 2 worker nodes. (More is better). In case of node failure, pods will be rescheduled on another healthy node.
I'd say linux containers are more lightweight and have less overhead, but it's up to you to decide what to use.
Number of pods per VM is defined during scheduling process by the kube-scheduler and depends on the pods' requested resources and amount of resources available on cluster nodes.
All data inside running containers in a Pod are lost after pod restart/deletion. You can import/restore DB content during pod startup using Init Containers(or DB replication) or configure volumes to save data between pod restarts.
You can easily decide which container you need to put in the same Pod if you look at your application set from the perspective of scaling, updating and availability.
If you can benefit from scaling, updating application parts independently and having several replicas of some crucial parts of your application, it's better to put them in the separate Deployments. If it's required for the application parts to run always on the same node and if it's fine to restart them all at once, you can put them in one Pod.
Do worker nodes in a multi-master setup talk to the apiserver on the master nodes via the load-balancer? It seems like the cluster is aware of the active apiserver endpoints via the endpoint reconciler, so I would think the logical and HA way is for the worker nodes to talk to the active endpoints it knows of. But according to the official documentation/diagram (https://kubernetes.io/docs/admin/high-availability/building/), it shows that the worker nodes goes through the load-balancer. Doesn't this mean that if for whatever reason the load-balancer becomes unavailable, your worker nodes will also malfunction?
When your kubelet starts, it needs to connect to the apiserver. The location of the apiserver is provided as a configuration option and in most cases will be a non-changing domain name pointing to a loadbalancer. You can not rely on ClusterIP based service for kubernetes main components like kubelet or kube-proxy as you would essentially be running your self into a chicken-and-egg situation / introducing additional dependencies.
Any reasonable environment should have a dependable loadbalancer, and it it is down, odds are that quite a lot of other things is down (also keep in mind that in many cases kubernetes will survive temporary inaccessibility of control plane)
I am working on writing some automation to setup a Kubernetes Cluster. The automation deploys the Kubernetes Master and once that is setup, it starts adding Minions in parallel. What is the most efficient way to determine programmatically if a Minion has joined the Kubernetes Cluster?
Currently I am querying the REST endpoint /v1/api/nodes exposed by the Kubernetes API-Server. My concern is that as the size of the cluster increases, querying the API-Server to pull details about all the minions may be compute and I/O intensive for the API-Server. I also did not find paging support in this API.
Thanks,
Sufian
You should look into kube-register https://github.com/kelseyhightower/kube-register. It uses fleet to register minions as they spin up. You should probably have it as a systemd unit so it runs on start up. Then for status, let the Api-server do it's thing with the polling status. Most clusters probably wouldn't be larger than 9 main nodes (you can have plenty worker nodes, I recommend looking at coreos's etcd docs to see about clustering) due to etcd's latency constraints in it's quorum over RAFT, so I wouldn't worry too much about the size of the cluster.
this is a mix between answer and comment on the other answer (I can not comment yet, sorry...)
As far as I know using the REST endpoint /v1/api/nodes is the best way to check if nodes are registered. How often do you call that endpoint? I wouldn't expect compute or I/O problems too fast.
kube-register was a useful tool to register new CoreOS nodes to the kubernetes cluster, but it is not needed anymore, since the kubelet registers itself in the meanwhile.
I think there is some misunderstanding in the other answer. I think you talk about 2 different clusters:
the etcd cluster: CoreOS recommends to run 3, 5 or 7 etcd instances in a cluster (https://coreos.com/etcd/docs/latest/admin_guide.html#cluster-management). On the remaining nodes you can configure etcd to run as a proxy (https://coreos.com/etcd/docs/latest/proxy.html). This should solve your etcd connection problem.
the kubernetes cluster: here you typically run 1 master and x "worker" nodes, just as you do already.