kubelet only choose the first api-server given, caused all services unavailable - kubernetes

one of The kubelet's start parameter is
-api-servers=[]: List of Kubernetes API servers for publishing events, and reading pods and services. (ip:port), comma separated.
It appears that it's designed for api-server's HA, only if one of the api server is alive, that everything will work well.
But I found that the kubelet would only choose the first api server, even if I gave it 3 api -servers. If the first api-server was stopped, all the services were unavailable.
The version I used is:
Kubernetes v1.2.1
So are there any ways to avoid this issue, Hopefully I just use it in a wrong way. Or I may fix it in the kubelet..
Any comments are appreciated.

This is expected.
In short, current model for HA expects load balancing (e.g., gcplb/elb/nginx/haproxy) in front of the apiserver, so that node components don't have to be aware of multiple apiservers. However, it's recognized that there is a need to pass multiple apiserver endpoints to kubernetes components, and is slotted for to be fixed for kubernetes v1.4.
See the detailed discussions in https://github.com/kubernetes/kubernetes/issues/18174

Related

Kubernetes-services load balancing

I have read this question which is very similar to what I am asking, but still wanted to write a new question since the accepted answer there seems very incomplete and also potentially wrong.
Basically, it seems like there is some missing or contradictory information regarding built in load-balancing for regular Kubernetes Services (I am not talking about LoadBalancer services). For example, the official Cilium documentation states that "Kubernetes doesn't come with an implementation of Load Balancing". In addition, I couldn't find any information in the official Kubernetes documentation about load balancing for internal services (there was only a section discussing this under ingresses).
So my question is - how does load balancing or distribution of requests work when we make a request from within a Kubernetes cluster to the internal address of a Kubernetes service?
I know there's a Kubernetes proxy on each node that creates the DNS records for such services, but what about services that span multiple pods and nodes? There's got to be some form of request distribution or load-balancing, or else this just wouldn't work at all, no?
A standard Kubernetes Service provides basic load-balancing. Even for a ClusterIP-type Service, the Service has its own cluster-internal IP address and DNS name, and forwards requests to the collection of Pods specified by its selector:.
In normal use, it is enough to create a multiple-replica Deployment, set a Service to point at its Pods, and send requests only to the Service. All of the replicas will receive requests.
The documentation discusses the implementation of internal load balancing in more detail than an application developer normally needs. Unless your cluster administrator has done extra setup, you'll probably get round-robin request routing – the first Pod will receive the first request, the second Pod the second, and so on.
... the official Cilium documentation states ...
This is almost certainly a statement about external load balancing. As a cluster administrator (not a programmer) a "plain" Kubernetes installation doesn't include an external load-balancer implementation, and a LoadBalancer-type Service behaves identically to a NodePort-type Service.
There are obvious deficiencies to round-robin scheduling, most notably if you do wind up having individual network requests that take a long time and a lot of resource to service. As an application developer the best way to address this is to make these very-long-running requests run asynchronously; return something like an HTTP 201 Created status with a unique per-job URL, and do the actual work in a separate queue-backed worker.

what are the Kubernetes modules directly communicating with etcd

I was trying to understand how exactly the kubernetes modules interacts with etcd. I understand kubernetes modules by themselves are stateless and they keep the states in etcd. But I am confused when it comes to how modules are interacting with etcd. I see conflicting texts on this, some saying all etcd interactions are happening through apiserver and some others say all the modules interacts with etcd.
I am looking for the possibility of changing etcd endpoint and restarting integration points so that they can work with new etcd instance.
I do not have time to go look in to the code to understand this part so hoping the someone here can help me on this.
If a kubernete component want to communicate with etcd, it must know the endpoint of etcd.
If you check the spec config of these components, you will find the correct answer: only api-server directly talk to etcd.
All kubernetes components, such as, kubelet, kubeproxy, scheduler, controllers etc. interact with etcd through API server. They dont directly talk to etcd.
if you change etcd endpoint, then same should be updated in api server configuration.

Ignite ReadinessProbe

Deploying an ignite cluster within Kubernetes, I cam across an issue that prevents cluster members from joining the group. If I use a readinessProbe and a livenessProbe, even with a delay as low as 10 seconds, they nodes never join each other. If I remove those probes, they find each other just fine.
So, my question is: can you use these probes to monitor node health, and if so, what are appropriate settings. On top of that, what would be good, fast health checks for Ignite, anyway?
Update:
After posting on the ignite mailing list, it looks like StatefulSets are the way to go. (Thanks Dmitry!)
I think I'm going to leave in the below logic to self-heal any segmentation issues although hopefully it won't be triggered often.
Original answer:
We are having the same issue and I think we have a workable solution. The Kubernetes discovery spi lists services as they become ready.
This means that if there are no ready pods at startup time, ignite instances all think that they are the first and create their own grid.
The cluster should be able to self heal if we have a deterministic way to fail pods if they aren't part of an 'authoritative' grid.
In order to do this, we keep a reference to the TcpDiscoveryKubernetesIpFinder and use it to periodically check the list of ignite pods.
If the instance is part of a cluster that doesn't contain the alphabetical first ip in the list, we know we have a segmented topology. Killing the pods that get into that state should cause them to come up again, look at service list and join the correct topology.
I am facing the same issue, using Ignite embedded within a Java spring application.
As you said the readinessProbe: on the Kubernetes Deployment spec.template.spec.container has the side effect to prevent the Kubernetes Pods from being listed on the related Kubernetes Service as Endpoints
Trying without any readinessProbe, it seems to indeed works better (Ignite nodes are all joinging the same Ignite cluster)
Yet this have the undesired side effect of exposing the Kubernetes Pods when not yet ready, as Spring has not yet fully started ...

What happens when the Kubernetes master fails?

I've been trying to figure out what happens when the Kubernetes master fails in a cluster that only has one master. Do web requests still get routed to pods if this happens, or does the entire system just shut down?
According to the OpenShift 3 documentation, which is built on top of Kubernetes, (https://docs.openshift.com/enterprise/3.2/architecture/infrastructure_components/kubernetes_infrastructure.html), if a master fails, nodes continue to function properly, but the system looses its ability to manage pods. Is this the same for vanilla Kubernetes?
In typical setups, the master nodes run both the API and etcd and are either largely or fully responsible for managing the underlying cloud infrastructure. When they are offline or degraded, the API will be offline or degraded.
In the event that they, etcd, or the API are fully offline, the cluster ceases to be a cluster and is instead a bunch of ad-hoc nodes for this period. The cluster will not be able to respond to node failures, create new resources, move pods to new nodes, etc. Until both:
Enough etcd instances are back online to form a quorum and make progress (for a visual explanation of how this works and what these terms mean, see this page).
At least one API server can service requests
In a partially degraded state, the API server may be able to respond to requests that only read data.
However, in any case, life for applications will continue as normal unless nodes are rebooted, or there is a dramatic failure of some sort during this time, because TCP/ UDP services, load balancers, DNS, the dashboard, etc. Should all continue to function for at least some time. Eventually, these things will all fail on different timescales. In single master setups or complete API failure, DNS failure will probably happen first as caches expire (on the order of minutes, though the exact timing is configurable, see the coredns cache plugin documentation). This is a good reason to consider a multi-master setup–DNS and service routing can continue to function indefinitely in a degraded state, even if etcd can no longer make progress.
There are actions that you could take as an operator which would accelerate failures, especially in a fully degraded state. For instance, rebooting a node would cause DNS queries and in fact probably all pod and service networking functionality until at least one master comes back online. Restarting DNS pods or kube-proxy would also be bad.
If you'd like to test this out yourself, I recommend kubeadm-dind-cluster, kind or, for more exotic setups, kubeadm on VMs or bare metal. Note: kubectl proxy will not work during API failure, as that routes traffic through the master(s).
Kubernetes cluster without a master is like a company running without a Manager.
No one else can instruct the workers(k8s components) other than the Manager(master node)(even you, the owner of the cluster, can only instruct the Manager)
Everything works as usual. Until the work is finished or something stopped them.(because the master node died after assigning the works)
As there is no Manager to re-assign any work for them, the workers will wait and wait until the Manager comes back.
The best practice is to assign multiple managers(master) to your cluster.
Although your data plane and running applications does not immediately starts breaking but there are several scenarios where cluster admins will wish they had multi-master setup. Key to understanding the impact would be understanding which all components talk to master for what and how and more importantly when will they fail if master fails.
Although your application pods running on data plane will not get immediately impacted but imagine a very possible scenario - your traffic suddenly surged and your horizontal pod autoscaler kicked in. The autoscaling would not work as Metrics Server collects resource metrics from Kubelets and exposes them in Kubernetes apiserver through Metrics API for use by Horizontal Pod Autoscaler and vertical pod autoscaler ( but your API server is already dead).If your pod memory shoots up because of high load then it will eventually lead to getting killed by k8s OOM killer. If any of the pods die, then since controller manager and scheduler talks to API Server to watch for current state of pods so they too will fail. In short a new pod will not be scheduled and your application may stop responding.
One thing to highlight is that Kubernetes system components communicate only with the API server. They don’t
talk to each other directly and so their functionality themselves could fail I guess. Unavailable master plane can mean several things - failure of any or all of these components - API server,etcd, kube scheduler, controller manager or worst the entire node had crashed.
If API server is unavailable - no one can use kubectl as generally all commands talk to API server ( meaning you cannot connect to cluster, cannot login into any pods to check anything on container file system. You will not be able to see application logs unless you have any additional centralized log management system).
If etcd database failed or got corrupted - your entire cluster state data is gone and the admins would want to restore it from backups as early as possible.
In short - a failed single master control plane although may not immediately impact traffic serving capability but cannot be relied on for serving your traffic.

kube2sky in kubernetes with multiple api servers

It a Kubernetes cluster where everything is highly available, the DNS is a key piece of the system, everything relies on the DNS.
The pod kube2sky has a parameter "-kube_master_url" where, afaik, you can only specify one api server node.
You might have multiple api servers for redundancy behing a service, but if the one that kube2sky is using gets down, the whole DNS system gets down too, hence, the highly availabily of the cluster is gone.
For other pods, you can use the internal DNS name of the api server service, but in this case, you can't since this is the actual DNS service.
Any idea how to solve this issue?
In its standard configuration, kube2sky doesn't actually rely on having a single apiserver IP address to use. Instead, it uses the virtual IP of the kubernetes service that gets auto-created in every cluster, and which the kube-proxy sets up iptables rules for. It's briefly explained in the docs on github.
Also, it's recommended that replicated masters are put behind a load balancer in such high-availability configurations to avoid problems like this with client tools.