helm rollback fails to identify the failed deployments when re-triggered

helm rollback fails to identify the failed deployments when re-triggered - kubernetes

I have a scenario like below,
Have two releases - Release-A and Release-B.
Currently, I am on Release-A and need an upgrade of all the microservices to Release-B.
I tried performing the helm upgrade of microservice - "mymicroservice" with the below command to deliver Release-B.
helm --kubeconfig /home/config upgrade --namespace testing --install --wait mymicroservice mymicroservice-release-b.tgz
Because of some issue, the deployment object got failed to install and went into an error state.
Observing this, I perform the below rollback command.
helm --kubeconfig /home/config --namespace testing rollback mymicroservice
Due to some issue(may be an intermittent system failure or user behavior), the Release-A's deployment object also went into failed/Crashloopbackoff state.Although this will result in helm rollback success, the deployment object is still not entered the running state.
Once I made the necessary corrections, I will retry the rollback. As the deployment spec is already updated with helm, it never attempts to re-install the deployment objects even if it is in the failed state.
Is there any option with Helm to handle the above scenarios ?.
Tried with --force flag, but there are other errors related to Service object replace in the microservice when used the --force flag approach.
Rollback "mymicroservice -monitoring" failed: failed to replace object: Service "mymicroservice-monitoring" is invalid: spec.clusterIP: Invalid value: "": field is immutable

Maybe this can help u out:
Always use the helm upgrade --install command. I've seen you're using so you're doing well. This installs the charts if they're not present and upgrades them if they're present.
Use --atomic flag to rollback changes in the event of a failed operation during helm upgrade.
And the flag --cleanup-on-fail: It allows that Helm deletes newly created resources during a rollback in case the rollback fails.
From doc:
--atomic: if set, upgrade process rolls back changes made in case of failed upgrade. The --wait flag will be set automatically if --atomic is used
--cleanup-on-fail allow deletion of new resources created in this upgrade when upgrade fails
There are cases where an upgrade creates a resource that was not present in the last release. Setting this flag allows Helm to remove those new resources if the release fails. The default is to not remove them (Helm tends to avoid destruction-as-default, and give users explicit control over this)
https://helm.sh/docs/helm/helm_upgrade/

IIRC, helm rollback rolls back to the previous version, whether it is good or not, so if your previous attempts resulted in a failure and you try to rollback, you will rollback to a broken version

Related

UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

Yesterday I stopped a helm upgrade when it was running on a release pipeline in Azure DevOps and the followings deployments got it failure.
I tried to see the chart that has failed with the aim of delete it but the chart of the microservice ("auth") doesn't appear. I used the command «helm list -n [namespace_of_AKS]» and it doesn't appear.
What can i do to solve this problem?
Error in Azure Release Pipeline
2022-03-24T08:01:39.2649230Z Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
2022-03-24T08:01:39.2701686Z ##[error]Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
Helm List

This error can happen for few reasons, but it most commonly occurs when there is an interruption during the upgrade/install process as you already mentioned.
To fix this one may need to, first rollback to another version, then reinstall or helm upgrade again.
Try below command to list
helm ls --namespace <namespace>
but you may note that when running that command ,it may not show any columns with information
Try to check the history of the previous deployment
helm history <release> --namespace <namespace>
This provides with information mostly like the original installation was never completed successfully and is pending state something like STATUS: pending-upgrade state.
To escape from this state, use the rollback command:
helm rollback <release> <revision> --namespace <namespace>
revision is optional, but you should try to provide it.
You may then try to issue your original command again to upgrade or reinstall.

helm ls -a -n {namespace} will list all releases within a namespace, regardless of status.
You can also use helm ls -aA instead to list all releases in all namespaces -- in case you actually deployed the release to a different namespace (I've done that before)

Try deleting the latest helm secret for the deployment and re-run your helm apply command.
kubectl get secret -A | grep <app-name>
kubectl delete secret <secret> -n <namespace>

Is it possible to roll back services?

In k8s you can roll back a deployment. Can you also roll back a service?
Rolling back a service may be helpful if there was an erroneous update done to a service resource.

rollback / rollout undo is not available for service resource:
kubectl rollout
Manage the rollout of a resource.
Valid resource types include:
* deployments
* daemonsets
* statefulsets

There is no option to roll back the service as answered by confused genius, however just adding my 50 cents.
If you are using the Helm chart for deployment you could implement some way to roll back all resources if your deployment fails.
So while upgrading the helm release version you can use the --atomic which will auto roll back the resources if your deployment fails.
$ helm upgrade --atomic -f myvalues.yaml -f override.yaml redis ./redis
--atomic if set, upgrade process rolls back changes made in case of failed upgrade. The --wait flag will be set
automatically if --atomic is used
Read more about the atomic helm
But again there is no default support for SVC to roll back like deployment.

Helm Release with existing resources

Previously we only use helm template to generate the manifest and apply to the cluster, recently we start planning to use helm install to manage our deployment, but running into following problems:
Our deployment is a simple backend api which contains "Ingress", "Service", and "Deployment", when there is a new commit, the pipeline will be triggered to deploy.
We plan to use the short commit sha as the image tag and helm release name. Here is the command
helm upgrade --install releaseName repo/chartName -f value.yaml --set image.tag=SHA
This runs perfectly fine for the first time, but when I create another release it fails with following error message
rendered manifests contain a resource that already exists. Unable to continue with install: Service "app-svc" in namespace "ns" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "rel-124": current value is "rel-123"
The error message is pretty clear on what the issue is, but I am just wondering what's "correct" way of using helm in this case?
It is not practical that I uninstall everything for a new release, and I also dont want to keep using the same release.

You are already doing it "right" way, just don't change release-name. That's key for Helm to identify resources. It seems that you previously used different name for release (rel-123) then you are using now (rel-124).
To fix your immediate problem, you should be able to proceed by updating value of annotation meta.helm.sh/release-name on problematic resource. Something like this should do it:
kubectl annotate --overwrite service app-svc meta.helm.sh/release-name=rel-124

Helm incorrectly shows upgrade failed status

When using helm install/upgrade in some percentage of the time I get this failure:
Failed to install app MyApp. Error: UPGRADE FAILED: timed out waiting for the condition
This is because the app sometimes needs a bit more time to be up and running.
When I get this message helm doesn't stop the install/upgrade, but still works on it, which will be succeed by the end. And my whole cluster will be fully functional.
However helm still shows this failed status for the release. On one hand it is pretty annoying, on the other hand it can mess up a correctly installed release.
How to remove this false error and get into a 'deployed' state(without a new install/upgrade)?

What you might find useful here are the two following options:
--wait: Waits until all Pods are in a ready state, PVCs are bound, Deployments have minimum (Desired minus maxUnavailable) Pods in
ready state and Services have an IP address (and Ingress if a
LoadBalancer) before marking the release as successful. It will wait
for as long as the --timeout value. If timeout is reached, the
release will be marked as FAILED. Note: In scenarios where
Deployment has replicas set to 1 and maxUnavailable is not set to 0
as part of rolling update strategy, --wait will return as ready as
it has satisfied the minimum Pod in ready condition.
--timeout: A value in seconds to wait for Kubernetes commands to complete This defaults to 5m0s
Helm install and upgrade commands include two CLI options to assist in checking the deployments: --wait and --timeout. When using --wait, Helm will wait until a minimum expected number of Pods in the deployment are launched before marking the release as successful. Helm will wait as long as what is set with --timeout.
Also, please note that this is not a full list of cli flags. To see a description of all flags, just run helm <command> --help.
If you want to check why your chart might have failed you can use the helm history command.

helmfile sync vs helmfile apply

sync sync all resources from state file (repos, releases and chart deps)
apply apply all resources from state file only when there are changes
sync
The helmfile sync sub-command sync your cluster state as described in your helmfile ... Under
the covers, Helmfile executes helm upgrade --install for each release declared in the
manifest, by optionally decrypting secrets to be consumed as helm chart values. It also
updates specified chart repositories and updates the dependencies of any referenced local
charts.
For Helm 2.9+ you can use a username and password to authenticate to a remote repository.
apply
The helmfile apply sub-command begins by executing diff. If diff finds that there is any changes
sync is executed. Adding --interactive instructs Helmfile to request your confirmation before sync.
An expected use-case of apply is to schedule it to run periodically, so that you can auto-fix skews
between the desired and the current state of your apps running on Kubernetes clusters.
I went through the Helmfile repo Readme to figure out the difference between helmfile sync and helmfile apply. It seems that unlike the apply command, the sync command doesn't do a diff and helm upgrades the hell out of all releases 😃. But from the word sync, you'd expect the command to apply those releases that have been changed. There is also mention of the potential application of helmfile apply to periodically syncing of releases. Why not use helmfile sync for this purpose? Overall, the difference didn't become crystal clear, and I though there could probably be more to it. So, I'm asking.

Consider the use case where you have a Jenkins job that gets triggered every 5 minutes and in that job you want to upgrade your helm chart, but only if there are changes.
If you use helmfile sync which calls helm upgrade --install every five minutes, you will end up incrementing Chart revision every five minutes.
$ helm upgrade --install httpd bitnami/apache > /dev/null
$ helm list
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
httpd 1 Thu Feb 13 11:27:14 2020 DEPLOYED apache-7.3.5 2.4.41 default
$ helm upgrade --install httpd bitnami/apache > /dev/null
$ helm list
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
httpd 2 Thu Feb 13 11:28:39 2020 DEPLOYED apache-7.3.5 2.4.41 default
So, each helmfile sync will result new revision. Now if you were to run helmfile apply, which will first check for diffs and only then (if found) will call helmfile sync which will in turn call helm upgrade --install this will not happen.

Everything in the other answers is correct. However, there is one additional important thing to understand with sync vs apply:
sync will call helm upgrade on all releases. Thus, a new helm secret will be created for each release, and releases' revisions will all be incremented by one. Helm 3 uses three-way strategic merge algorithm to compute patches, meaning it will revert manual changes made to resources in the live cluster handled by Helm;
apply will first use the helm-diff plugin to determine which releases changed, and only run helm upgrade on the changed ones. However, helm-diff only compares the previous Helm state to the new one, it doesn't take the state of the live cluster into account. This means if you modified something manually, helmfile apply (or more precisely the helm-diff plugin) may not detect it, and leave it as-is.
Thus, if you always modify the live cluster through helm, helmfile apply is the way to go. However, if you can have manual changes and want to ensure the live state is coherent with what is defined in Helm/helmfile, then helmfile sync is necessary.
If you want more details, checkout my blog post helmfile: difference between sync and apply (Helm 3)

In simple words, this is how both of them works:
helmfile sync calls helm upgrade --install
helmfile apply calls helm upgrade --install if and only if helm diff returns some changes.
So, generally, helmfile apply would be faster and I suggest using it most of the time.
But, put in your consideration that if someone has deleted any deployments or configmaps manually from the chart, helm diff won't see any changes hence,helmfile apply won't work and these resources will still be deleted, while helmfile sync will recreate it restoring original chart configuration.

We have detected one significant issue that also has implications.
Sometimes a sync or apply operation can fail due to:
Timeout with wait: true. E.g. k8s cluster needs to add more nodes and operation takes longer than expected (but eventually everything is deployed).
Temporary error on a postsync hook.
In these cases a simple retry on the deployment job of the pipeline would solve the issue but succesive helmfile apply will skip both helm upgrade and hook execution, even if release is in status failed.
So my conclusions are:
apply is usually faster but can lead to situations where manual (outside CI/CD logic) are required.
sync is more robust (CI/CD friendly) but usually slower.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse