Master Kubernetes nodes offline GKE (mutliple clusters and projects)

Master Kubernetes nodes offline GKE (mutliple clusters and projects) - kubernetes

This morning we noticed that all Kubernetes clusters in all projects ( 2 projects, 2 clusters per project ) showed unavailable / ERROR in the Google Cloud Console.
The dashboard shows no current issues: https://status.cloud.google.com/
It basically looks like the master nodes are down, the API does not respond and the clusters cannot be edited in the UI. Before the weekend everything was up and since at least yesterday evening they all show in red.
The deployed services fortunately respond, but we cannot manage the cluster in any way.
I reported it here too:
https://issuetracker.google.com/issues/172841082
Did anyone else encounter this and is there any way to restart or trigger the master node to restart? I cannot edit the cluster so an upgrade is not possible either.
I read elsewhere that only SRE folks from Google can (re)start them.
It's beyond me how this can happen.
By the way, auto-repair is set to on and I followed the troubleshooting page, basically with all paths leading to : master node down, nothing to be done.
Any help would be greatly appreciated, or simply a SRE doing a start node action ;).

Thank you #dany L, it was indeed a billing issue.
I'm surprised there is nothing like a message in the Cloud Console and one has to go to billing specifically to find out about this.
After billing was fixed, it took a few minutes while before the clusters were available, then everything looked back to normal.

Related

How to avoid congestion when using Kubernetes pods as Jenkins slaves

Our usecase is pretty simple, however, I haven't found a solution for it yet.
In the organization I'm working at, we decided to move to Kubernetes as our container manager in order to spin-up slaves.
Until we moved to this kind of environment, we used to have dedicated slaves per each team. Each got the resources it needs and based on that, it was working.
However, when we moved to use Kubernetes, it started to cause issues as each team shares the same pile of resources, which, can lead to congestion or job failures.
The suggested solution was to create Kubernetes cluster per each team, however, this will lead to burnout of the teams involved with maintanance of multiple clusters.
Searching online, I didn't found any solution avilable, hence, I'm asking here - what is the best way to approach the solution? I understand that we might need to implament a dispacher, but currently it's not possible in the way the Kubernetes plugin is developed.
Thanks,

Two versions of fluentd fighting over port in my cluster

Somehow, I have 2 versions of fluentd running in my cluster:
They end up fighting over the same port, they just keep cranking away, trying to start up on that port, and it saturates all the CPU in the cluster.
unexpected error error_class=Errno::EADDRINUSE error="Address already in use - bind(2) for 0.0.0.0:24231
/opt/google-fluentd/embedded/lib/ruby/2.6.0/socket.rb:201:in 'bind'
I've tried deleting the daemon sets and deployments, they just keep coming back. Also tried ssh'ing into the machines and killing the process on that port. Nothing seems to work.
Obviously, I only want one version of fluentd to run (and I'm not even sure which one).

I seem to have fixed it. I went to GCP dashboard cluster edit page, Kubernetes Engine Monitoring dropdown was blank. It seems not even the dropdown could decide what to display here.
It seems the automated agent, or whatever, seriously messed up here, and had 2 versions of the logging and monitoring system running, fighting over a port, and crushing the CPU on every machine in the cluster. On top of that, I couldn't delete the daemon sets, pods, or deployments. It seems Google treats these as special somehow, maybe with some kind of automated agent, I don't know.
From the dropdown, I just selected System and workload logging and monitoring, saved, and it applied the changes.
Everything looking good so far, but this whole event has me worried, I didn't do anything. This just....happened.
This is a dev cluster, but if it was a production cluster...

How to detect GKE autoupgrading a node in Stackdriver logs

We have a GKE cluster with auto-upgrading nodes. We recently noticed a node become unschedulable and eventually deleted that we suspect was being upgraded automatically for us. Is there a way to confirm (or otherwise) in Stackdriver that this was indeed the cause what was happening?

You can use the following advanced logs queries with Cloud Logging (previously Stackdriver) to detect upgrades to node pools:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_nodepool"
and master:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_cluster"
Additionally, you can control when the update are applied with Maintenance Windows (like the user aurelius mentioned).

I think your question has been already answered in the comments. Just as addition automatic upgrades occur at regular intervals at the discretion of the GKE team. To get more control you can create a Maintenance Windows as explained here. This is basically a time frame that you choose in which automatic upgrades should occur.

(How) do node pool autoupgrades in GKE actually work?

We have a fairly large kubernetes deployment on GKE, and we wanted to make our life a little easier by enabling auto-upgrades. The documentation on the topic tells you how to enable it, but not how it actually works.
We enabled the feature on a test cluster, but no nodes were ever upgraded (although the UI kept nagging us that "upgrades are available").
The docs say it would be updated to the "latest stable" version and that it occurs "at regular intervals at the discretion of the GKE team" - both of which is not terribly helpful.
The UI always says: "Next auto-upgrade: Not scheduled"
Has someone used this feature in production and can shed some light on what it'll actually do?
What I did:
I enabled the feature on the nodepools (not the cluster itself)
I set up a maintenance window
Cluster version was 1.11.7-gke.3
Nodepools had version 1.11.5-gke.X
The newest available version was 1.11.7-gke.6
What I expected:
The nodepool would be updated to either 1.11.7-gke.3 (the default cluster version) or 1.11.7-gke.6 (the most recent version)
The update would happen in the next maintenance window
The update would otherwise work like a "manual" update
What actually happened:
Nothing
The nodepools remained on 1.11.5-gke.X for more than a week
My question
Is the nodepool version supposed to update?
If so, at what time?
If so, to what version?

I'll finally answer this myself. The auto-upgrade does work, though it took several days to a week until the version was upgraded.
There is no indication of the planned upgrade date, or any feedback other than the version updating.
It will upgrade to the current master version of the cluster.
Addition: It still doesn't work reliably, and still no way to debug if it doesn't. One information I got was that the mechanism does not work if you initially provided a specific version for the node pool. As it is not possible to deduce the inner workings of the autoupdates, we had to resort to manually checking the status again.

I wanted to share two other possibilities as to why a node-pool may not be auto-upgrading or scheduled to upgrade.
One of our projects was having the similar issue where the master version had auto-upgraded to 1.14.10-gke.27 but our node-pool stayed stuck at 1.14.10-gke.24 for over a month.
Reaching a node quota
The node-pool upgrade might be failing due to a node quota (although I'm not sure the web console would say Next auto-upgrade: Not scheduled). From the node upgrades documentation, it suggests we can run the following to view any failed upgrade operations:
gcloud container operations list --filter="STATUS=DONE AND TYPE=UPGRADE_NODES AND targetLink:https://container.googleapis.com/v1/projects/[PROJECT_ID]/zones/[ZONE]/clusters/[CLUSTER_NAME]"
Automatic node upgrades are for minor+ versions only
After exhausting my troubleshooting steps, I reached out GCP Support and opened a case (Case 23113272 for anyone working at Google). They told me the following:
Automatic node upgrade:
The node version could not necessary upgrade automatically, let me explain, exists three upgrades in a node: Minor versions (1.X), Patch releases (1.X.Y) and Security updates and bug fixes (1.X.Y-gke.N), please take a look at this documentation [2] the automatic node upgrade works from a minor version and in your case the upgrade was a security update that can't upgrade automatically.
I responded back and they confirmed that automatic node upgrades will only happen for minor versions and above. I have requested that they submit a request to update their documentation because (at the time of this response) it is not outlined anywhere in their node auto-upgrade documentation.

This feature replaces the VMs (Kubernetes nodes) in your node pool running the "old" Kubernetes version with VMs running the "new" version.
The node pool "upgrade" operation is done in a rolling fashion: It's not like GKE deletes all your VMs and recreates them simultaneously (except when you have only 1 node in your cluster). By default, the nodes are replaced with newer nodes one-by-one (although this might change).
GKE internally uses mostly the features of managed instance groups to manage operations on node pools.
You can find documentation on how to schedule node upgrades by specifying certain "maintenance windows" so you are impacted minimally. (This article also gives a bit more insights on how upgrades happen.)
That said, you can disable auto-upgrades and upgrade your cluster manually (although this is not recommended). Some GKE users have thousands of nodes, therefore for them, upgrading VMs one-by-one are not feasible.
For that GKE offers an option that lets you choose "how many nodes are upgraded at a time":
gcloud container clusters upgrade \
--concurrent-node-count=CONCURRENT_NODE_COUNT
Documentation of this flag says:
The number of nodes to upgrade concurrently. Valid values are [1, 20]. It is a recommended best practice to set this value to no higher than 3% of your cluster size.'

Service Fabric "Waiting for upgrade..." using VSTS

I've configured upgrades on my VSTS release of a Service Fabric app containing 5 services to a single node test environment on Azure. Unfortunately when it gets to the release part it just hangs saying "Waiting for upgrade..." over and over again. I left it for 15 hours and it still says the same thing. The initial deployment went ahead without issue.
I've looked at various posts about turning off health check times, but this has not been successful. I've also tried setting the mode to UnmonitoredAuto, but no success.
I've RDPd onto the environment and checked the processor/memory usage in task manager, and everything is pretty much 0%, and very low memory usage.
Is there anything else I can do to stop the upgrade hanging?

OK, I've managed to fix this. This was happening because there is a PreUpgradeSafetyCheck that happens before rolling out an upgrade. This is not relevant for a single node cluster as downtime is inevitable for single node clusters.
The status of an upgrade can be found using: Get-ServiceFabricApplicationUpgrade. Which shows the status above.
To fix this there is a flag: UpgradeReplicaSetCheckTimeoutSec in the release task. Setting the value to 0 sorts things out.