GKE cluster v1.2.4 - route recreation - kubernetes

I've created a 3 nodes GKE cluster which is running fine but I noticed a few times that my components are not able to reach the API Server during 3 or 4 minutes.
I recently had the same problem again on a fresh new cluster so I decided to look a bit closer. In the Compute Engine Operations section I noticed at the same time that the 3 routes had been removed and recreated 4 minutes later... This task had been scheduled by a #cloudservices.gserviceaccount.com address so from the cluster directly I suppose.
What is causing this behavior, forcing the routes to be deleted and recreated randomly ?

The apiserver may become unreachable if it gets temporarily overloaded or if it is upgraded or repaired. This should be unrelated to routes being removed and recreated, although it's possible that the node manager does not behave correctly when it is restarted.

Related

Prevent Kubernetes from schedulling everything to the same node

So I have 4 nodes currently, and Kubernetes, for some reason, decides to always schedule everything to the same node.
I'm not talking about replicas of the same deployment, so topologySpreadConstraints wouldn't apply there. In fact, when I scale up a deployment to several replicas, they get scheduled to different nodes. However, any new deployment and any new volume always go to the same node.
Affinity constraints also work, if I configure a pod to only schedule to a specific node (different from the usual one) it works fine. But anything else, goes to the same node. Is this considered normal? The node is at 90% utilization, and even when it crashes completely, Kubernetes happily schedules everything to it again.
Okay, so this was a very specific issue, and I'm not sure whether I actually resolved it, but it seems to be working now.
This was a cluster deployed on hetzner and using the hetzner cloud controller manager. I had been removing and adding nodes to the cluster and, as it turns out, I forgot to add the flag --cloud-provider=external to this one's kubelet
This issue is pretty well known. It was specifically showing as a "missing prefix" event, so I never thought it was related.
To solve it, adding the flag and restarting the kubelet was not enough for me. So I had to drain the node, remove it from the cluster, build it from scratch and re-join it then with the correct flag. This not only solved the "missing prefix" issue, but it also seems to have also solved the scheduling issue, though I'm not sure why.

GCP - Kubernetes Node creation failure

All cluster resources were brought up, but: only 0 nodes out of 4 have registered; this is likely due to Nodes failing to start correctly; check VM Serial logs for errors, try re-creating the cluster or contact support if that doesn't work.
I've tried recreating it several times in different zones figuring google was doing maintenance. I've also tried with fewer node sizes with no luck.

Synchronize and rollback independent deployments in kubernetes

I have k8s setup that contains 2 deployments: client and server deployed from different images. Both deployments have replica sets inside, liveness and readiness probes defined. The client communicates with the server via k8s' service.
Currently, the deployment scripts for both client and server are separated (separate yaml files applied via kustomization). Rollback works correctly for both parts independently but let's consider the following scenario:
1. deployment is starting
2. both deployment configurations are applied
3. k8s master starts replacing pods of server and client
4. server pods start correctly so new replica set has all the new pods up and running
5. client pods have an issue, so the old replica set is still running
In many cases it's not a problem, because client and server work independently, but there are situations when breaking change to the server API is released and both client and server must be updated. In that case if any of these two fails then both should be rolled back (doesn't matter which one fails - both needs to be rolled back to be in sync).
Is there a way to achieve that in k8s? I spent quite a lot of time searching for some solution but everything I found so far describes deployments/rollbacks of one thing at the time and that doesn't solve the issue above.
The problem here is something covered in Blue/Green deployments.
Here is a good reference of Blue/Green deployments with k8s.
The basic idea is, you deploy the new version (Green deployment) while keeping the previous version (Blue deployment) up and running and only allow traffic to the new version (Green deployment) when everything went fine.

kubectl get nodes hangs when I delete a node externally

Been experimenting with Kubernetes/Rancher and encountered some unexpected behavior. Today I'm deliberately putting on my chaos monkey hat and learning how things behave when stuff fails.
Here's what I've done:
1) Using the Rancher UI, stand up a 3 node cluster on Digital Ocean
Success -- a few mins later I have a 3 node cluster, visible in Rancher.
2) Using the Rancher UI, I deleted a node in a 'happy' scenario where I push the appropriate node delete button using Rancher.
Some minutes later, I have a 2 node cluster. Great.
3) Using the Digital Ocean admin UI, I delete a node in an 'oops' scenario as if a sysadmin accidentally deleted a node.
Back on the ranch (sorry), I click here to view the state of the cluster:
Unfortunately after three minutes, I'm getting a gateway timeout
Detailed timeouts in Chrome network inspector
Here's what kubectl says:
$ kubectl get nodes
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
So, question is, what happened here?
I was under the impression Kubernetes was 'self healing' and even if this node I deleted was the etcd leader, it would eventually recover. Been around 2 hours -- do I just need to wait more?

Pods are deleted regularly

I have a kubernetes cluster, created and managed from stackpoint.io.
The cluster is made of 1 master and 2 nodes, all running on coreOs.
I created 5 deployments with 1 replicas each and 5 services pointing to these deployments.
5 pods are created from these deployments.
My problem is : between 24 to 36 hours after the pods are created, they are automatically deleted and recreated.
because these different apps rely on each other to work, it takes about 3-5 minutes of downtime before everything works correctly again.
I assume there is some configuration I don't know about that control this behavior.
I tested using containers with latest tag as well as not latest. I tested changing the pullStrategie from Always to IfNotPresent. I tested using a ubuntu based cluster instead of CoreOs. In every configuration the pods are deleted and recreated