I have a Rancher Kubernetes Cluster running and my application containing sevral pods is running as a helm chart. When I wanted to update my application, I updated my container image and redeployed the pod. Since 3 years, this worked well. Suddenly, when I try to redeploy my frontend pod, I get the following error message from the rancher gui:
Deployment generation is 35, but latest observed generation is 34
I googled the error and "deployment generation", but it seems umcommon to have this problem. There are almost no results from google, what makes me wonder...The pod is not getting deployed currently.
Does anyone has a hint on why this suddenly happens and how to fix that?
Thanks in advance,
Ben
Just need to pause and redeploy the deployment.
It was due to consecutive failures of the deployment to try to be at a running state due to so internal errors of the app like the database being down or something else which was an internal server error.
I don't know the exact solution to this, but I had to reboot the whole cluster to get rid of this message and regain the ability to deploy. Seems to be something really odd.
Related
I'm trying to keep the execution logs of containers in Kubernetes.
I added in my cronjob yaml the successfulJobsHistoryLimit: 5 failedJobsHistoryLimit: 5 in order to see the execution history, but when I try to view the logs of the pods I get this error
I assume it is because the pods have been deleted because when I go to a running pod I can see the logs.
So is there a way of keeping the logs in this part of Kubernetes or is there something that I have to setup in order to have this functionality?
Sorry if the question have been asked but I didn't really find something and I'm new to Kubernetes.
Thanks for the replies.
Looking at this problem in a bigger picture it's generally a good idea to have your logs stored via logging agents or directly pushed into an external service as per the official documentation.
Taking advantage of Kubernetes logging architecture explained here you can also try to fetch the logs directly from the log-rotate files in the node hosting the pods. Please note that this option might depend on the specific Kubernetes implementation as log files might be deleted when the pod eviction is triggered.
We have multiple headless service running in our Azure's AKS VMAS cluster. Sometimes (randomly), we have observed that the coredns fails to resolve the headless services with the following error logs:
E0909 09:31:22.241120 1 runtime.go:73] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Please note that, while facing the above mentioned issues, the non-headless service(services which have cluster IPs), gets resolved properly without any hassle.
For resolving the issue in the dev/svt environment, we terminate the coredns pod in kube-system namespace, and everything starts working fine again, for brief period of time - 1/2 days.
This deletion operation cannot be performed in the customer deployment scenario.
We raised a ticket with the AKS team, but since coredns is a third-party project, it doesn't come under Azure's support domain.
Has anyone faced this issue with coredns?
What is the permanent solution for this issue ?
Maybe it will help someone https://github.com/coredns/coredns/issues/4022
This is a known defect in CoreDNS you need to upgrade CoreDNS inside AKS to use a newer version with the fix applied 1.7.0
Due to a memory leak in one of our services I am planning to add a k8s CronJob to schedule a periodic restart of the leaking service. Right now we do not have the resources to look into the mem leak properly, so we need a temporary solution to quickly minimize the issues caused by the leak. It will be a rolling restart, as outlined here:
How to schedule pods restart
I have already tested this in our test cluster, and it seems to work as expected. The service has 2 replicas in test, and 3 in production.
My plan is to schedule the CronJob to run every 2 hours.
I am now wondering: How will the new CronJob behave if it should happen to execute while a service upgrade is already running? We do rolling upgrades to achieve zero downtime, and we sometimes roll out upgrades several times a day. I don't want to limit the people who deploy upgrades by saying "please ensure you never deploy near to 08:00, 10:00, 12:00 etc". That will never work in the long term.
And vice versa, I am also wondering what will happen if an upgrade is started while the CronJob is already running and the pods are restarting.
Does kubernetes have something built-in to handle this kind of conflict?
This answer to the linked question recommends using kubectl rollout restart from a CronJob pod. That command internally works by adding an annotation to the deployment's pod spec; since the pod spec is different, it triggers a new rolling upgrade of the deployment.
Say you're running an ordinary redeployment; that will change the image: setting in the pod spec. At about the same time, the kubectl rollout restart happens that changes an annotation setting in the pod spec. The Kubernetes API forces these two changes to be serialized, so the final deployment object will always have both changes in it.
This question then reduces to "what happens if a deployment changes and needs to trigger a redeployment, while a redeployment is already running?" The Deployment documentation covers this case: it will start deploying new pods on the newest version of the pod spec and treat all older ones as "old", so a pod with the intermediate state might only exist for a couple of minutes before getting replaced.
In short: this should work consistently and you shouldn't need to take any special precautions.
I am new to Kubernetes and started working with it from past one month.
When creating the setup of cluster, sometimes I see that Heapster will be stuck in Container Creating or Pending status. After this happens the only way have found here is to re-install everything from the scratch which has solved our problem. Later if I run the Heapster it would run without any problem. But I think this is not the optimal solution every time. So please help out in solving the same issue when it occurs again.
Heapster image is pulled from the github for our use. Right now the cluster is running fine, So could not send the screenshot of the heapster failing with it's status by staying in Container creating or Pending status.
Suggest any alternative for the problem to be solved if it occurs again.
Thanks in advance for your time.
A pod stuck in pending state can mean more than one thing. Next time it happens you should do 'kubectl get pods' and then 'kubectl describe pod '. However, since it works sometimes the most likely cause is that the cluster doesn't have enough resources on any of its nodes to schedule the pod. If the cluster is low on remaining resources you should get an indication of this by 'kubectl top nodes' and by 'kubectl describe nodes'. (Or with gke, if you are on google cloud, you often get a low resource warning in the web UI console.)
(Or if in Azure then be wary of https://github.com/Azure/ACS/issues/29 )
We have some odd issue happening with GCE.
We have 2 clusters dev and prod each consisting of 2 nodes.
Production nodes are n1-standard-2, dev - n1-standard-1.
Typically dev cluster is busier with more pods eating more resources.
We deploy updates mostly with deployments (few projects still recreate RCs to update to latest versions)
Normally, the process is: build project, build docker image, docker push, create new deployment config and kubectl apply new config.
What's constantly happening on production is after applying new config, single or both nodes restart. Cluster does not seem to be starving with memory/cpu and we could not find anything in the logs that would explain those restarts.
Same procedure on staging never causes nodes to restart.
What can we do to diagnose the issue? Any specific events,logs we should be looking at?
Many thanks for any pointers.
UPDATE:
This is still happening and I found following in Computer Engine - Operations:
repair-1481931126173-543cefa5b6d48-9b052332-dfbf44a1
Operation type: compute.instances.repair.recreateInstance
Status message : Instance Group Manager 'projects/.../zones/europe-west1-c/instanceGroupManagers/gke-...' initiated recreateInstance on instance 'projects/.../zones/europe-west1-c/instances/...'. Reason: instance's intent is RUNNING but instance's health status is TIMEOUT.
We still can't figure out why this is happening and it's having a negative effect on our production environment every time we deploy our code.