Helm incorrectly shows upgrade failed status - kubernetes

When using helm install/upgrade in some percentage of the time I get this failure:
Failed to install app MyApp. Error: UPGRADE FAILED: timed out waiting for the condition
This is because the app sometimes needs a bit more time to be up and running.
When I get this message helm doesn't stop the install/upgrade, but still works on it, which will be succeed by the end. And my whole cluster will be fully functional.
However helm still shows this failed status for the release. On one hand it is pretty annoying, on the other hand it can mess up a correctly installed release.
How to remove this false error and get into a 'deployed' state(without a new install/upgrade)?

What you might find useful here are the two following options:
--wait: Waits until all Pods are in a ready state, PVCs are bound, Deployments have minimum (Desired minus maxUnavailable) Pods in
ready state and Services have an IP address (and Ingress if a
LoadBalancer) before marking the release as successful. It will wait
for as long as the --timeout value. If timeout is reached, the
release will be marked as FAILED. Note: In scenarios where
Deployment has replicas set to 1 and maxUnavailable is not set to 0
as part of rolling update strategy, --wait will return as ready as
it has satisfied the minimum Pod in ready condition.
--timeout: A value in seconds to wait for Kubernetes commands to complete This defaults to 5m0s
Helm install and upgrade commands include two CLI options to assist in checking the deployments: --wait and --timeout. When using --wait, Helm will wait until a minimum expected number of Pods in the deployment are launched before marking the release as successful. Helm will wait as long as what is set with --timeout.
Also, please note that this is not a full list of cli flags. To see a description of all flags, just run helm <command> --help.
If you want to check why your chart might have failed you can use the helm history command.

Related

Dynamic Deployment of pods as per some conditions || Deploying pod only when the process inside other pod has finished [duplicate]

This question already has answers here:
What is the equivalent for depends_on in kubernetes
(6 answers)
How can we create service dependencies using kubernetes
(3 answers)
Helm install, Kubernetes - how to wait for the pods to be ready?
(3 answers)
Closed 1 year ago.
Need some help on implementing better kubernetes resource deployment.
Essentially, we are trying to mention every resource in a single values.yaml file.
When you install the chart. All resources are created parallelly. Among these I've 2 components.
Let's say component1 and component2.
For component1, It's main function will be to install some dars into the server machine. This will take between 45 min to an hour.
For component2, It is dependent on some dars that will be installed onto server by commponent1.
Problem is, When you deploy helm chart and every pod is created at the same time.
Even though status for a pod for component2 will be running. When you inspect the container logs it will tell you process start up failed. Due to some missing classes(Which would've been installed by component1)
I am looking for a way, by which I can either introduce some delay until component1 is done or Keep destroying and recreating resources for component2 until component1 is done.
Delay based on if all dars are installed into server machine.
For restarting all resources for component2 I was thinking about creating a 3rd pod or a maintenance pod. Which will keep looking up both components1 and 2 , and it will keep restarting resource creation for component2 until component1 is done.
Readiness and liveliness probes will not work here because even though service startup has failed. Pod status will be running.
Any tips or suggestions on how to implement this will greatly help. Or if there's a better way to handle this.
You can try adding the flag --wait or --wait-for-job according to your usecase as Helm will wait until a minimum expected number of Pods in the deployment are launched before marking the release as successful. Helm will wait as long as what is set with --timeout. Please refer the link --wait flag detailed description and https://helm.sh/docs/helm/helm_upgrade/#options
helm upgrade --install --wait --timeout 20 demo demo

helm rollback fails to identify the failed deployments when re-triggered

I have a scenario like below,
Have two releases - Release-A and Release-B.
Currently, I am on Release-A and need an upgrade of all the microservices to Release-B.
I tried performing the helm upgrade of microservice - "mymicroservice" with the below command to deliver Release-B.
helm --kubeconfig /home/config upgrade --namespace testing --install --wait mymicroservice mymicroservice-release-b.tgz
Because of some issue, the deployment object got failed to install and went into an error state.
Observing this, I perform the below rollback command.
helm --kubeconfig /home/config --namespace testing rollback mymicroservice
Due to some issue(may be an intermittent system failure or user behavior), the Release-A's deployment object also went into failed/Crashloopbackoff state.Although this will result in helm rollback success, the deployment object is still not entered the running state.
Once I made the necessary corrections, I will retry the rollback. As the deployment spec is already updated with helm, it never attempts to re-install the deployment objects even if it is in the failed state.
Is there any option with Helm to handle the above scenarios ?.
Tried with --force flag, but there are other errors related to Service object replace in the microservice when used the --force flag approach.
Rollback "mymicroservice -monitoring" failed: failed to replace object: Service "mymicroservice-monitoring" is invalid: spec.clusterIP: Invalid value: "": field is immutable
Maybe this can help u out:
Always use the helm upgrade --install command. I've seen you're using so you're doing well. This installs the charts if they're not present and upgrades them if they're present.
Use --atomic flag to rollback changes in the event of a failed operation during helm upgrade.
And the flag --cleanup-on-fail: It allows that Helm deletes newly created resources during a rollback in case the rollback fails.
From doc:
--atomic: if set, upgrade process rolls back changes made in case of failed upgrade. The --wait flag will be set automatically if --atomic is used
--cleanup-on-fail allow deletion of new resources created in this upgrade when upgrade fails
There are cases where an upgrade creates a resource that was not present in the last release. Setting this flag allows Helm to remove those new resources if the release fails. The default is to not remove them (Helm tends to avoid destruction-as-default, and give users explicit control over this)
https://helm.sh/docs/helm/helm_upgrade/
IIRC, helm rollback rolls back to the previous version, whether it is good or not, so if your previous attempts resulted in a failure and you try to rollback, you will rollback to a broken version

kubectl wait sometimes timed out unexpectedly

I just add kubectl wait --for=condition=ready pod -l app=appname --timeout=30s in the last step of BitBucket Pipeline to report any deployment failure if the new pod somehow producing error.
I realize that the wait doesn't really consistent. Sometimes it gets timed out even if new pod from new image doesn't producing any error, pod turn to ready state.
Try to always change deployment.yaml or push newer image everytime to test this, the result is inconsistent.
BTW, I believe using kubectl rollout status doesn't suitable, I think because it just return after the deployment done without waiting for pod ready.
Note that there is not much difference if I change timeout from 30s to 5m since apply or rollout restart is quite instant.
kubectl version: 1.17
AWS EKS: latest 1.16
I'm placing this answer for better visibility as noted in the comments this indeed solves some problems with kubectl wait behavior.
I managed to replicate the issue and have some timeouts when my client version was older than server version. You have to match your client version with server in order to kubectl wait work properly.

Is there any kubectl command to poll until all the pod roll to new code?

I am building deploy pipeline. I Need a "kubectl" command that would tell me that rollout is completed to all the pods then I can deploy to next stage.
The Deployment documentation suggests kubectl rollout status, which among other things will return a non-zero exit code if the deployment isn't complete. kubectl get deployment will print out similar information (how many replicas are expected, available, and up-to-date), and you can add a -w option to watch it.
For this purpose you can also consider using one of the Kubernetes APIs. You can "get" or "watch" the deployment object, and get back something matching the structure of a Deployment object. Using that you can again monitor the replica count, or the embedded condition list, and decide if it's ready or not. If you're using the "watch" API you'll continue to get updates as the object status changes.
The one trick here is detecting failed deployments. Say you're deploying a pod that depends on a database; usual practice is to configure the pod with the hostname you expect the database to have, and just crash (and get restarted) if it's not there yet. You can briefly wind up in CrashLoopBackOff state when this happens. If your application or deployment is totally wrong, of course, you'll also wind up in CrashLoopBackOff state, and your deployment will stop progressing. There's not an easy way to tell these two cases apart; consider an absolute timeout.

helm test failure: timed out waiting for the condition

We have a simple release test for a Redis chart. After running helm test myReleaseName --tls --cleanup, we got
RUNNING: myReleaseName-redis
ERROR: timed out waiting for the condition
There are several issues in Github repository at https://github.com/helm/helm/search?q=timed+out+waiting+for+the+condition&type=Issues but I did not find a solution to it.
What's going on here?
This first looks puzzling and shows little information because --cleanup will kill the pods after running. One can remove it to get more information. I, thus, reran the test with
helm test myReleaseName --tls --debug
Then use kubectl get pods to examine the pod used for testing. (It could be of other names.)
NAME READY STATUS RESTARTS AG
myReleaseName-redis 0/1 ImagePullBackOff 0 12h
From here, it is more clear now that there is something wrong with images, and it turned out that the link used to pull the image is not correct. (Use kubectl describe pod <pod-name> and then you can find the link you used to pull the image.)
Fix the link, and it worked.
For me, helm couldn't pull the image as it was in private repo
kubectl get events helped me get the logs.
9m38s Warning Failed pod/airflow-scheduler-bbd8696bf-5mfg7 Failed to pull image
After authenticating, helm install command worked.
REF: https://github.com/helm/charts/issues/11904
if helm test <ReleaseName> --debug shows installation completed successfully but deployment failed, may be because of deployment takes more than 300 sec.
Helm will wait as long as what is set with --timeout . By default, the timeout is set to 5min, sometimes for many reason helm install may take extra time to deploy, so increase the timeout value and validate the installation.
helm install <ReleaseName> --debug --wait --timeout 30m