How to revert pending `kops` changes? - kubernetes

What I have done?
$ kops edit ig nodes ## added some nodes by modifying min/max nodes parameters
$ ... ## did some other modifications
$ kops update cluster ## saw the pending changes before applying them to the cluster
In kops update cluster I saw some not desired changes and wanted to revert all of them and start over.
But given that the pending changes live in a state file in S3, I was not able to revert such changes easily.
I realized I should have had an S3 versioning enabled to revert the changes quickly and easily, but I didn't have that, and needed another way of cancelling pending changes.
Do you know how to achieve that?
PS. I googled for "kops drop pending changes" and similar; I browsed kops manual page; I tried accessing kops from multiple accounts before I realized that the changes are shared via S3... Nothing helped.
UPDATE:
See the https://kops.sigs.k8s.io/tutorial/working-with-instancegroups/, it says
To preview the change:
kops update cluster
...
Will modify resources:
*awstasks.LaunchTemplate LaunchTemplate/mycluster.mydomain.com
InstanceType t2.medium -> t2.large
Presuming you're happy with the change, go ahead and apply it: kops update cluster --yes
So - what I want to do is to revert the not-yet-applied changes if i'm not happy with them. Is there any way other than kops edit ig nodes and manually changing t2.large back to t2.medium, repeating that for all changes I want to revert?

As the production cluster recommendations states, storing the configuration in version control is recommended to avoid these scenarios.
But given that you are already facing a challenge now ...
It depends on what is introducing the change. If it is a change in your cluster config alone, reverting those changes would be the way to go.
If the changes comes from a kOps upgrade, there is no real way of reverting. kOps should never be downgraded, and the changes it tries to do are there for good reasons.
If you can share what the unwanted changes are, I may be able to determine how to revert them.

Related

Kubernetes - Upgrading Kubernetes-cluster version through Terraform

I assume there are no stupid questions, so here is one that I could not find a direct answer to.
The situation
I currently have a Kubernetes-cluster running 1.15.x on AKS, deployed and managed through Terraform. AKS recently Azure announced that they would retire the 1.15 version of Kubernetes on AKS, and I need to upgrade the cluster to 1.16 or later. Now, as I understand the situation, upgrading the cluster directly in Azure would have no consequences for the content of the cluster, I.E nodes, pods, secrets and everything else currently on there, but I can not find any proper answer to what would happen if I upgrade the cluster through Terraform.
Potential problems
So what could go wrong? In my mind, the worst outcome would be that the entire cluster would be destroyed, and a new one would be created. No pods, no secrets, nothing. Since there is so little information out there, I am asking here, to see if there are anyone with more experience with Terraform and Kubernetes that could potentially help me out.
To summary:
Terraform versions
Terraform v0.12.17
+ provider.azuread v0.7.0
+ provider.azurerm v1.37.0
+ provider.random v2.2.1
What I'm doing
§ terraform init
//running terrafrom plan with new Kubernetes version declared for AKS
§ terraform plan
//Following changes are announced by Terraform:
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
~ update in-place
Terraform will perform the following actions:
#module.mycluster.azurerm_kubernetes_cluster.default will be updated in-place...
...
~ kubernetes_version = "1.15.5" -> "1.16.13"
...
Plan: 0 to add, 1 to change, 0 to destroy.
What I want to happen
Terraform will tell Azure to upgrade the existing AKS-service, not destroy before creating a new one. I assume that this will happen, as Terraform announces that it will "update in-place", instead of adding new and/or destroying existing clusters.
I found this question today and thought I'd add my experience as well. I made the following changes:
Changed the kubernetes_version under azurerm_kubernetes_cluster from "1.16.15" -> "1.17.16"
Changed the orchestrator_version under default_node_pool from "1.16.15" -> "1.17.16"
Increased the node_count under default_node_pool from 1 -> 2
A terraform plan showed that it was going to update in-place. I then performed a terraform apply which completed successfully. kubectl get nodes showed that an additional node was created, but both nodes in the pool were still on the old version. After further inspection in Azure Portal it was found that only the k8s cluster version was upgraded and not the version of the node pool. I then executed terraform plan again and again it showed that the orchestrator_version under default_node_pool was going to be updated in-place. I then executed terraform apply which then proceeded to upgrade the version of the node pool. It did that whole thing where it creates an additional node in the pool (with the new version) and sets the status to NodeSchedulable while setting the existing node in the pool to NodeNotSchedulable. The NodeNotSchedulable node is then replaced by a new node with the new k8s version and eventually set to NodeSchedulable. It did this for both nodes. Afterwards all nodes were upgraded without any noticeable downtime.
I'd say this shows that the Terraform method is non-destructive, even if there have at times been oversights in the upgrade process (but still non-destructive in this example): https://github.com/terraform-providers/terraform-provider-azurerm/issues/5541
If you need higher confidence for this change then you could alternativly consider using the Azure-based upgrade method, refreshing the changes back into your state, and tweaking the code until a plan generation doesn't show anything intolerable. The two azurerm_kubernetes_cluster arguments dealing with version might be all you need to tweak.

Regarding GKE optimised image hardening

We tried to harden the gke optimized image (gke-1.15.11) for our cluster. We took an ssh into the node instance and made the cis porposed changes in the /home/kubernetes/kubelet-config.yaml file and ran kubebench to check if all the conditions have passed around 8 condtions failed these where the exact conditions we changed in the file. But, then we made the exact argument changes in /etc/default/kubernetes and ran kubebench again the conditions passed. But, when we restarted the instance we all the changes we made in the /ect/default/kubernetes file where gone. Can someone let me know where we are going wrong or is there any other path where we have to make the cis benchmark suggested entries
GKE doesn't support user-provided node images as of April 2020. Recommended option is to create your own DaemonSet with host filesystem writes and/or host services restart to propagate all the required changes.

GKE Node Upgrade "Out of Resources"

I had left my GKE cluster running 3 minor versions behind the latest and decided to finally upgrade. The master upgrade went well but then my node upgrades kept failing. I used the Cloud Shell console to manually start an upgrade and view the output, which said something along the lines of "Zone X is out of resources, try Y instead." Unfortunately,I can't just spin up a new node pool in a new zone and have my pipeline work because I am using GitLab's AutoDevOps pipeline and they make certain assumptions about node pool naming and such that I can't find any way to override. I also don't want to potentially lose the data stored in my persistent volumes if I end up needing to re-create everything in a new node-pool.
I just solved this issue but couldn't find any questions posed on this particular problem, so I wanted to post the answer here in case someone else comes looking for it.
My particular problem was that I had a non-autoscaling node pool with a single node. For my purposes, that's enough for the application stack to run smoothly and I don't want to incur unforeseen charges with additional nodes automatically being added to the pool. However, this meant that the upgrade had to apparently share resources with everything else running on that node to perform the upgrade, which it didn't have enough of. The solution was simple: add more nodes temporarily.
Because this is specifically GKE, I was able to use a beta feature called "surge upgrade", which allows you to set the maximum number of "surge" nodes to add when performing an upgrade. Once this was enabled, I started the upgrade process again and it temporarily added an extra node, performed the upgrade, and then scaled back down to a single node.
If you aren't on GKE, or don't wish to use a beta feature (or can't), then simply resize the node pool with the node(s) that needs upgrading. I would add a single node unless you are positive you need more.

How to detect GKE autoupgrading a node in Stackdriver logs

We have a GKE cluster with auto-upgrading nodes. We recently noticed a node become unschedulable and eventually deleted that we suspect was being upgraded automatically for us. Is there a way to confirm (or otherwise) in Stackdriver that this was indeed the cause what was happening?
You can use the following advanced logs queries with Cloud Logging (previously Stackdriver) to detect upgrades to node pools:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_nodepool"
and master:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_cluster"
Additionally, you can control when the update are applied with Maintenance Windows (like the user aurelius mentioned).
I think your question has been already answered in the comments. Just as addition automatic upgrades occur at regular intervals at the discretion of the GKE team. To get more control you can create a Maintenance Windows as explained here. This is basically a time frame that you choose in which automatic upgrades should occur.

Is it possible to do a rolling update and retain same version in Kubernetes?

Background:
We're currently using a continuous delivery pipeline and at the end of the pipeline we deploy the generated Docker image to some server(s) together with the latest application configuration (set as environment variables when starting the Docker container). The continuous delivery build number is used as version for the Docker image and it's currently also this version that gets deployed to the server(s).
Sometimes though we need to update the application configuration (environment variables) and reuse an existing Docker image. Today we simply deploy an existing Docker image with an updated configuration.
Now we're thinking of switching to Kubernetes instead of our home-built solution. Thus it would be nice for us if the version number generated by our continuous delivery pipeline is reflected as the pod version in Kubernetes as well (even if we deploy the same version of the Docker image that is currently deployed but with different environment variables).
Question:
I've read the documentation of rolling-update but it doesn't indicate that you can do a rolling-update and only change the environment variables associated with a pod without changing its version.
Is this possible?
Is there a workaround?
Is this something we should avoid altogether and use a different approach that is more "Kubernetes friendly"?
Rolling update just scales down one replicationController and scales up another one. Therefore, it deletes the old pods and make new pods, at a controlled rate. So, if the new replication controller json file has different env vars and the same image, then the new pods will have that too.
In fact, even if you don't change anything in the json file, except one label value (you have to change some label), then you will get new pods with the same image and env. I guess you could use this to do a rolling restart?
You get to pick what label(s) you want to change when you do a rolling update. There is no formal Kubernetes notion of a "version". You can make a label called "version" if you want, or "contdelivver" or whatever.
I think if I were in your shoes, I would look at two options:
Option 1: put (at least) two labels on the rcs, one for the docker image version (which, IIUC is also a continuous delivery version), and one for the "environment version". This could be a git commit, if you store your environment vars in git, or something more casual. So, your pods could have labels like "imgver=1.3,envver=a34b87", or something like that.
Option 2: store the current best known replication controller, as a json (or yaml) file in version control (git, svn, whatevs). Then use the revision number from version control as a single label (e.g. "version=r346"). This is not the same as your continuous delivery label.
It is a label for the whole configuration of the pod.