Flapping metrics in DC/OS dashboard after changing master nodes - apache-zookeeper

After changing two of three master nodes in an DC/OS 1.8 cluster to a newer CoreOS version (one with a kernel that is patched against the DirtyCOW vulnerability) the masters stopped working. The dashboard showed an empty data center.
We synchronized /var/lib/dcos from the old master to the two new master nodes. Then the dashboard started working again. The DC/OS dashboard still shows flapping metrics.
We have a mesos.leader and a zookeeper leader.
How can we stabilize the cluster?

Last time this happened to us we had to reinstall the cluster. I just finished stopping our master nodes one at a time to increase the disk size. We are now back in the flapping state. I think a reinstall is in our future. I'm searching for answers now to help avoid that.

Related

HA in k8s cluster

let's imagine situation - I have HA cluster with 3 Control plane node, with CP endpoint floating ip adress. First node down - ok, no problem, switch ip dest and go on. Second node down, and cluster goes to unavailiable state. So sad
Question - is possible return cluster in avaliable state, after falled nodes will be up?
Because my previous expiriense said no
Thanks
Avaliable cluster after nodes up
Yes.
It is possible to recover from 1, 2 or all 3 masters down.
Boot them.
Make sure etcd cluster gets back up, or fix whatever issue there could be (disk full, expired certs, ...)
Then make sure kube-apiserver gets back up. Next kube-controller-manager & kube-scheduler.
At which point, your kubelets should already be re-registering and workloads starting back up.
If you use a managed kubernetes cluster you don't have to worry about this, but if you're running your own masters you don't even need to worry about the floating IP. You just bring up new masters and join them to the existing master(s) and you're back up to fighting strength.

AWS EKS auto update side effects

I have two EKS cluster running on control plane version 1.18 with 10 managed worker node group, waiting on third party vendors to provide support for some application running on these clusters. Wanted to know what sort of outages will occur when AWS automatically updates the control plane version to 1.19. Will any outage occur or will the worker nodes keep working besides being on an older version. Just do not want to be surprised with an outage when the automatic update occurs.

How to avoid downtime during scheduled maintenance window

I'm experiencing downtimes whenever the GKE cluster gets upgraded during the maintenance window. My services (APIs) become unreachable for like ~5min.
The cluster Location type is set to "Zonal", and all my pods have 2 replicas. The only affected pods seem to be the ones using nginx ingress controller.
Is there anything I can do to prevent this? I read that using Regional clusters should prevent downtimes in the control plane, but I'm not sure if it's related to my case. Any hints would be appreciated!
You mention "downtime" but is this downtime for you using the control plane (i.e. kubectl stop working) or is it downtime in that the end user who is using the services stops seeing the service working.
A GKE upgrade upgrades two parts of the cluster: the control plane or master nodes, and the worker nodes. These are two separate upgrades although they can happen at the same time depending on your configuration of the cluster.
Regional clusters can help with that, but they will cost more as you are having more nodes, but the upside is that the cluster is more resilient.
Going back to the earlier point about the control plane vs node upgrades. The control plane upgrade does NOT affect the end-user/customer perspective. The services will remaining running.
The node upgrade WILL affect the customer so you should consider various techniques to ensure high availability and resiliency on your services.
A common technique is to increase replicas and also to include pod antiaffinity. This will ensure the pods are scheduled on different nodes, so when the node upgrade comes around, it doesn't take the entire service out because the cluster scheduled all the replicas on the same node.
You mention the nginx ingress controller in your question. If you are using Helm to install that into your cluster, then out of the box, it is not setup to use anti-affinity, so it is liable to be taken out of service if all of its replicas get scheduled onto the same node, and then that node gets marked for upgrade or similar.

How to restart unresponsive kubernetes master in GKE

The kubernetes master in one of my GKE clusters became unresponsive last night following the infrastructure issue in us-central1-a.
Whenever I run "kubectl get pods" in the default namespace I get the following error message:
Error from server: an error on the server has prevented the request from succeeding
If I run "kubectl get pods --namespace=kube-system", I only see the kube-proxy and the fluentd-logging daemon.
I have trying scaling the cluster down to 0 and then scaling it back up. I have also tried downgrading and upgrading the cluster but that seems to apply only to the nodes (not the master). Is there any GKE/K8S API command to issue a restart to the kubernetes master?
There is not a command that will allow you to restart the Kubernetes master in GKE (since the master is considered a part of the managed service). There is automated infrastructure (and then an oncall engineer from Google) that is responsible for restarting the master if it is unhealthy.
In this particular cases, restarting the master had no effect on restoring it to normal behavior because Google Compute Engine Incident #16011 caused an outage on 2016-06-28 for GKE masters running in us-central1-a (even though that isn't indicated on the Google Cloud Status Dashboard). During the incident, many masters were unavailable.
If you had tried to create a GCE cluster using kube-up.sh during that time, you would have similarly seen that it would be unable to create a functional master VM due to the SSD Persistent disk latency issues.
I'm trying to have at least one version to upgrade ready, if you trying to upgrade the master, it will restart and work within few minutes. Otherwise you should wait around 3 days while Google team will reboot it. On e-mail/phone, then won't help you. And unless you have payed support (transition to which taking few days), they won't give a bird.

How can I protect my GKE cluster against master node failure?

In GKE every cluster has a single master endpoint, which is managed by Google Container Engine. Is this master node high available?
I deploy a beautiful cluster of redundant nodes with kubernetes but what happen if the master node goes down? How can i test this situation?
In Google Container Engine the master is managed for you and kept running by Google. According to the SLA for Google Container Engine the master should be available at least 99.5% of the time.
In addition to what Robert Bailey said about GKE keeping the master available for you, it's worth noting that Kubernetes / GKE clusters are designed (and tested) to continue operating properly in the presence of failures. If the master is unavailable, you temporarily lose the ability change what's running in the cluster (i.e. schedule new work or modify existing resources), but everything that's already running will continue working properly.