GKE Node Upgrade "Out of Resources" - kubernetes

I had left my GKE cluster running 3 minor versions behind the latest and decided to finally upgrade. The master upgrade went well but then my node upgrades kept failing. I used the Cloud Shell console to manually start an upgrade and view the output, which said something along the lines of "Zone X is out of resources, try Y instead." Unfortunately,I can't just spin up a new node pool in a new zone and have my pipeline work because I am using GitLab's AutoDevOps pipeline and they make certain assumptions about node pool naming and such that I can't find any way to override. I also don't want to potentially lose the data stored in my persistent volumes if I end up needing to re-create everything in a new node-pool.

I just solved this issue but couldn't find any questions posed on this particular problem, so I wanted to post the answer here in case someone else comes looking for it.
My particular problem was that I had a non-autoscaling node pool with a single node. For my purposes, that's enough for the application stack to run smoothly and I don't want to incur unforeseen charges with additional nodes automatically being added to the pool. However, this meant that the upgrade had to apparently share resources with everything else running on that node to perform the upgrade, which it didn't have enough of. The solution was simple: add more nodes temporarily.
Because this is specifically GKE, I was able to use a beta feature called "surge upgrade", which allows you to set the maximum number of "surge" nodes to add when performing an upgrade. Once this was enabled, I started the upgrade process again and it temporarily added an extra node, performed the upgrade, and then scaled back down to a single node.
If you aren't on GKE, or don't wish to use a beta feature (or can't), then simply resize the node pool with the node(s) that needs upgrading. I would add a single node unless you are positive you need more.

Related

How to reduce downtime caused by pulling images in the Kubernetes Recreate deployment strategy

Assuming I have a Kubernetes Deployment object with the Recreate strategy and I update the Deployment with a new container image version. Kubernetes will:
scale down/kill the existing Pods of the Deployment,
create the new Pods,
which will pull the new container images
so the new containers can finally run.
Of course, the Recreate strategy is exepected to cause a downtime between steps 1 and 4, where no Pod is actually running. However, step 3 can take a lot of time if the container images in question are or the container registry connection is slow, or both. In a test setup (Azure Kubernetes Services pulling a Windows container image from Docker Hub), I see it taking 5 minutes and more, which makes for a really long downtime.
So, what is a good option to reduce that downtime? Can I somehow get Kubernetes to pull the new images before killing the Pods in step 1 above? (Note that the solution should work with Windows containers, which are notoriously large, in case that is relevant.)
On the Internet, I have found this Codefresh article using a DaemonSet and Docker in Docker, but I guess Docker in Docker is no longer compatible with containerd.
I've also found this StackOverflow answer that suggests using an Azure Container Registry with Project Teleport, but that is in private preview and doesn't support Windows containers yet. Also, it's specific to Azure Kubernetes Services, and I'm looking for a more general solution.
Surely, this is a common problem that has a "standard" answer?
Update 2021-12-21: Because I've got a corresponding answer, I'll clarify that I cannot easily change the deployment strategy. The application in question does not support running Pods of different versions at the same time because it uses a database that needs to be migrated to the corresponding application version, without forwards or backwards compatibility.
Implement a "blue-green" deployment strategy. For instance, the service might be running and active in the "blue" state. A new deployment is created with a new container image, which deploys the "green" pods with the new container image. When all of the "green" pods are ready, the "switch live" step is run, which switches the active color. Very little downtime.
Obviously, this has tradeoffs. Your cluster will need more memory to run the additional transitional pods. The deployment process will be more complex.
Via https://www.reddit.com/r/kubernetes/comments/oeruh9/can_kubernetes_prepull_and_cache_images/, I've found these ideas:
Implement a DaemonSet that runs a "sleep" loop on all the images I need.
Use http://github.com/mattmoor/warm-image, which has no Windows support.
Use https://github.com/ContainerSolutions/ImageWolf, which says, "ImageWolf is currently alpha software and intended as a PoC - please don't run it in production!"
Use https://github.com/uber/kraken, which seems to be a registry, not a pre-pulling solution.
Use https://github.com/dragonflyoss/Dragonfly (now https://github.com/dragonflyoss/Dragonfly2), which also seems to do somethings completely different.
Use https://github.com/senthilrch/kube-fledged, which looks exactly right and more mature than the others, but has no Windows support.
Use https://github.com/dcherman/image-cache-daemon, which has no Windows support.
Use https://goharbor.io/blog/harbor-2.1/, which also seems to be a registry, not a pre-pulling solution.
Use https://openkruise.io/docs/user-manuals/imagepulljob/, which also looks right, but a) OpenKruise is huge and I'm not sure I want to install this just to preload images, and b) it seems it has no Windows support.
So, it seems I have to implement this on my own, with a DaemonSet. I still hope someone can provide a better answer than this one 🙂 .

How to upsize volume of Terraformed EKS node

We have been using Terraform for almost a year now to manage all kinds of resources on AWS from bastion hosts to VPCs, RDS and also EKS.
We are sometimes really baffled by the EKS module. It could however be due to lack of understanding (and documentation), so here it goes:
Problem: Upsizing Disk (volume)
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "12.2.0"
cluster_name = local.cluster_name
cluster_version = "1.19"
subnets = module.vpc.private_subnets
#...
node_groups = {
first = {
desired_capacity = 1
max_capacity = 5
min_capacity = 1
instance_type = "m5.large"
}
}
I thought the default value for this (dev) k8s cluster's node can easily be the default 20GBs but it's filling up fast so I know want to change disk_size to let's say 40GBs.
=> I thought I could just add something like disk_size=40 and done.
terraform plan tells me I need to replace the node. This is a 1 node cluster, so not good. And even if it were I don't want to e.g. drain nodes. That's why I thought we are using managed k8s like EKS.
Expected behaviour: since these are elastic volumes I should be able to upsize but not downsize, why is that not possible? I can def. do so from the AWS UI.
Sure with a slightly scary warning:
Are you sure that you want to modify volume vol-xx?
It may take some time for performance changes to take full effect.
You may need to extend the OS file system on the volume to use any newly-allocated space
But I can work with the provided docs on that: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html?icmpid=docs_ec2_console
Any guidelines on how to up the storage? If I do so with the UI but don't touch Terraform then my EKS state will be nuked/out of sync.
To my knowledge, there is currently no way to resize an EKS node volume without recreating the node using Terraform.
Fortunately, there is a workaround: As you also found out, you can directly change the node size via the AWS UI or API. To update your state file afterward, you can run terraform apply -refresh-only to download the latest data (e.g., the increased node volume size). After that, you can change the node size in your Terraform plan to keep both plan and state in sync.
For the future, you might want to look into moving to ephemeral nodes as (at least my) experience shows that you will have unforeseeable changes to clusters and nodes from time to time. Already planning with replaceable nodes in mind will make these changes substantially easier.
By using the terraform-aws-eks terraform module you are actually following the "ephemeral nodes" paradigm, because for both ways of creating instances (self-managed workers or managed node groups) the module is creating Autoscaling Groups that create EC2 instances out of a Launch Template.
ASG and Launch Templates are specifically designed so that you don't care anymore about specific nodes, and rather you just care about the number of nodes. This means that for updating the nodes, you just replace them with new ones, which will use the new updated launch template (with more GBs for example, or with a new updated AMI, or a new instance type).
This is called "rolling updates", and it can be done manually (adding new instances, then draining the node, then deleting the old node), with scripts (see: eks-rolling-update in github by Hellofresh), or it can be done automagically if you use the AWS managed nodes (the ones you are actually using when specifying "node_groups", that is why if you add more GB, it will replace the node automatically when you run apply).
And this paradigm is the most common when operating Kubernetes in the cloud (and also very common on-premise datacenters when using virtualization).
Option 1) Self Managed Workers
With self managed nodes, when you change a parameter like disk_size or instance_type, it will change the Launch Template. It will update the $latest version tag, which is commonly where the ASG is pointing to (although can be changed). This means that old instances will not see any change, but new ones will have the updated configuration.
If you want to change the existing instances, you actually want to replace them with new ones. That is what this ephemeral nodes paradigm is.
One by one you can drain the old instances while increasing the number of desired_instances on the ASG, or let the cluster autoscaler do the job. Alternatively, you can use an automated script which does this for you for each ASG: https://github.com/hellofresh/eks-rolling-update
In terraform_aws_eks module, you create self managed workers by either using worker_groups or worker_groups_launch_template (recommended) field
Option 2) Managed Nodes
Managed nodes is an EKS-specific feature. You configure them very similarly, but in reality, it is an abstraction, and AWS will create the actual underlying ASG.
You can specify a Launch Template to be used by the ASG and its version. Some config can be specified at the managed node level (i.e. AMI and instance_types) and at the Launch Template (if it wasn't specified in the former).
Any change on the node group level config, or on the Launch Template version, will trigger an automatic rolling update, which will replace all old instances.
You can delay the rolling update by just not pointing to the $latest version (or pointing to $default, and not updating the $default tag when changing the LT).
In terraform_aws_eks module, you create self managed workers by using the node_groups field. You can also play with these settings: create_launch_template=true and set_instance_types_on_lt=true if you want the module to create the LT for you (alternatively you can just not use it, or pass a reference to one); and to set the instance_type on such LT as specified above.
But behavior is similar to worker groups. In no case you will have your existing instances changed. You can only change them manually.
However, there is an alternative: The manual way
You can use the EKS module to create the control plane, but then use a regular EC2 resource in terraform (https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance) to create one ore multiple (using count or for_each) instances.
If you create the instances using the aws_instance resource, then terraform will patch those instances (updated-in-place) when any change is allowed (i.e. increasing the root volue GB or the instance type; whereas changing the AMI will force a replacement).
The only tricky part, is that you need to configure the cloud-init script to make the instance join the cluster (something that is automatically done by the EKS module when using self/managed node groups).
However, it is very possible, and you can borrow the script from the module and plug it into the aws_instance's user_data field (https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance#user_data)
In this case (when talking about disk_size), however, you still need to manually (either by SSH, or by running an hacky exec using terraform) to patch the XFS filesystem so it sees the increased disk space.
Another alternative: Consider Kubernetes storage
That said, there is also another alternative for certain use cases. If you want to increase the disk space of those instances because of one of your applications using a hostPath, then it might be the case that you can use a kubernetes built-in storage solution using the EBS CSI driver.
For example, I manage an ElasticSearch cluster in Kubernetes (and deploy it from terraform with the helm module), and it uses dynamic storage provisioning to request an EBS volume (note that performance is the same, because both root and this other volume are EBS volumes). EBS CSI driver supports volume expansion, so I can just increase this disk by changing a terraform variable.
To conclude, I would not recommend the aws_instance way, unless you understand it and are sure you really want it. It may make sense in certain cases, but definitely not common

How to detect GKE autoupgrading a node in Stackdriver logs

We have a GKE cluster with auto-upgrading nodes. We recently noticed a node become unschedulable and eventually deleted that we suspect was being upgraded automatically for us. Is there a way to confirm (or otherwise) in Stackdriver that this was indeed the cause what was happening?
You can use the following advanced logs queries with Cloud Logging (previously Stackdriver) to detect upgrades to node pools:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_nodepool"
and master:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_cluster"
Additionally, you can control when the update are applied with Maintenance Windows (like the user aurelius mentioned).
I think your question has been already answered in the comments. Just as addition automatic upgrades occur at regular intervals at the discretion of the GKE team. To get more control you can create a Maintenance Windows as explained here. This is basically a time frame that you choose in which automatic upgrades should occur.

(How) do node pool autoupgrades in GKE actually work?

We have a fairly large kubernetes deployment on GKE, and we wanted to make our life a little easier by enabling auto-upgrades. The documentation on the topic tells you how to enable it, but not how it actually works.
We enabled the feature on a test cluster, but no nodes were ever upgraded (although the UI kept nagging us that "upgrades are available").
The docs say it would be updated to the "latest stable" version and that it occurs "at regular intervals at the discretion of the GKE team" - both of which is not terribly helpful.
The UI always says: "Next auto-upgrade: Not scheduled"
Has someone used this feature in production and can shed some light on what it'll actually do?
What I did:
I enabled the feature on the nodepools (not the cluster itself)
I set up a maintenance window
Cluster version was 1.11.7-gke.3
Nodepools had version 1.11.5-gke.X
The newest available version was 1.11.7-gke.6
What I expected:
The nodepool would be updated to either 1.11.7-gke.3 (the default cluster version) or 1.11.7-gke.6 (the most recent version)
The update would happen in the next maintenance window
The update would otherwise work like a "manual" update
What actually happened:
Nothing
The nodepools remained on 1.11.5-gke.X for more than a week
My question
Is the nodepool version supposed to update?
If so, at what time?
If so, to what version?
I'll finally answer this myself. The auto-upgrade does work, though it took several days to a week until the version was upgraded.
There is no indication of the planned upgrade date, or any feedback other than the version updating.
It will upgrade to the current master version of the cluster.
Addition: It still doesn't work reliably, and still no way to debug if it doesn't. One information I got was that the mechanism does not work if you initially provided a specific version for the node pool. As it is not possible to deduce the inner workings of the autoupdates, we had to resort to manually checking the status again.
I wanted to share two other possibilities as to why a node-pool may not be auto-upgrading or scheduled to upgrade.
One of our projects was having the similar issue where the master version had auto-upgraded to 1.14.10-gke.27 but our node-pool stayed stuck at 1.14.10-gke.24 for over a month.
Reaching a node quota
The node-pool upgrade might be failing due to a node quota (although I'm not sure the web console would say Next auto-upgrade: Not scheduled). From the node upgrades documentation, it suggests we can run the following to view any failed upgrade operations:
gcloud container operations list --filter="STATUS=DONE AND TYPE=UPGRADE_NODES AND targetLink:https://container.googleapis.com/v1/projects/[PROJECT_ID]/zones/[ZONE]/clusters/[CLUSTER_NAME]"
Automatic node upgrades are for minor+ versions only
After exhausting my troubleshooting steps, I reached out GCP Support and opened a case (Case 23113272 for anyone working at Google). They told me the following:
Automatic node upgrade:
The node version could not necessary upgrade automatically, let me explain, exists three upgrades in a node: Minor versions (1.X), Patch releases (1.X.Y) and Security updates and bug fixes (1.X.Y-gke.N), please take a look at this documentation [2] the automatic node upgrade works from a minor version and in your case the upgrade was a security update that can't upgrade automatically.
I responded back and they confirmed that automatic node upgrades will only happen for minor versions and above. I have requested that they submit a request to update their documentation because (at the time of this response) it is not outlined anywhere in their node auto-upgrade documentation.
This feature replaces the VMs (Kubernetes nodes) in your node pool running the "old" Kubernetes version with VMs running the "new" version.
The node pool "upgrade" operation is done in a rolling fashion: It's not like GKE deletes all your VMs and recreates them simultaneously (except when you have only 1 node in your cluster). By default, the nodes are replaced with newer nodes one-by-one (although this might change).
GKE internally uses mostly the features of managed instance groups to manage operations on node pools.
You can find documentation on how to schedule node upgrades by specifying certain "maintenance windows" so you are impacted minimally. (This article also gives a bit more insights on how upgrades happen.)
That said, you can disable auto-upgrades and upgrade your cluster manually (although this is not recommended). Some GKE users have thousands of nodes, therefore for them, upgrading VMs one-by-one are not feasible.
For that GKE offers an option that lets you choose "how many nodes are upgraded at a time":
gcloud container clusters upgrade \
--concurrent-node-count=CONCURRENT_NODE_COUNT
Documentation of this flag says:
The number of nodes to upgrade concurrently. Valid values are [1, 20]. It is a recommended best practice to set this value to no higher than 3% of your cluster size.'

What is the recommended way to upgrade a kubernetes cluster as new versions are released?

What is the recommended way to upgrade a kubernetes cluster as new versions are released?
I heard here it may be https://github.com/kubernetes/kubernetes/blob/master/cluster/kube-push.sh. If that is the case how does kube-push.sh relate to https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cluster/gce/upgrade.sh?
I've also heard here that we should instead create a new cluster, copy/move the pods, replication controllers, and services from the first cluster to the new one and then turn off the first cluster.
I'm running my cluster on aws if that is relevant.
The second script you reference (gce/upgrade.sh) only works if your cluster is running on GCE. There isn't (yet) an equivalent script for AWS, but you could look at the script and follow the steps (or write them into a script) to get the same behavior.
The main different between upgrade.sh and kube-push.sh is that the former does a replacement upgrade (remove a node, create a new node to replace it) whereas the later does an "in place" upgrade.
Removing and replacing the master node only works if the persistent data (etcd database, server certificates, authorized bearer tokens, etc) reside on a persistent disk separate from the boot disk of the master (this is how it is configured by default in GCE). Remove and replacing nodes should be fine on AWS (but keep in mind that any pods not under a replication controller won't be restarted).
Doing an in-place upgrade doesn't require any special configuration, but that code path isn't as thoroughly tested as the remove and replace option.
You shouldn't need to entirely replace your cluster when upgrading to a new version, unless you are using pre-release versions (e.g. alpha or beta releases) which can sometimes have breaking changes between them.