What is stored in a kubernetes job and how do I check resource use of old job(s)? - kubernetes

This morning I learned about the (unfortunate) default in kubernetes of all previously run cronjobs' jobs instances being retained in the cluster. Mea culpa for not reading that detail in the documentation. I also notice that deleting jobs (kubectl delete job [<foo> or --all]) takes quite a long time. Further, I noticed that even a reasonably provisioned kubernetes cluster with three large nodes appears to fail (get timeouts of all sorts when trying to use kubectl) when there are just ~750 such old jobs in the system (plus some other active containers that otherwise had not entailed heavy load) [Correction: there were also ~7k pods associated with those old jobs that were also retained :-o]. (I did learn about the configuration settings to limit/avoid storing old jobs from cronjobs, so this won't be a problem [for me] in the future.)
So, since I couldn't find documentation for kubernetes about this, my (related) questions are:
what exactly is stored when kubernetes retains old jobs? (Presumably it's the associated pod's logs and some metadata, but this doesn't explain why they seemed to place such a load on the cluster.)
is there a way to see the resources (disk only, I assume, but maybe
there is some other resource) that individual or collective old jobs
are using?
why does deleting a kubernetes job take on the order of a minute?

I don't know if k8s provides that kinda details of what job is consuming how much disk space but here is something you can try.
Try to find the pods associated with the job:
kubectl get pods --selector=job-name=<job name> --output=jsonpath={.items..metadata.name}
Once you know the pod then find the docker container associated with it:
kubectl describe pod <pod name>
In the above output look for Node & Container ID. Now go on that node and in that node goto path /var/lib/docker/containers/<container id found above> here you can do some investigation to find out what is wrong.

Related

Kubernetes - keeping the execution logs of a pod

I'm trying to keep the execution logs of containers in Kubernetes.
I added in my cronjob yaml the successfulJobsHistoryLimit: 5 failedJobsHistoryLimit: 5 in order to see the execution history, but when I try to view the logs of the pods I get this error
I assume it is because the pods have been deleted because when I go to a running pod I can see the logs.
So is there a way of keeping the logs in this part of Kubernetes or is there something that I have to setup in order to have this functionality?
Sorry if the question have been asked but I didn't really find something and I'm new to Kubernetes.
Thanks for the replies.
Looking at this problem in a bigger picture it's generally a good idea to have your logs stored via logging agents or directly pushed into an external service as per the official documentation.
Taking advantage of Kubernetes logging architecture explained here you can also try to fetch the logs directly from the log-rotate files in the node hosting the pods. Please note that this option might depend on the specific Kubernetes implementation as log files might be deleted when the pod eviction is triggered.

Kubernetes Deployment Terminate Oldest Pod

I'm using Azure Kubernetes Service to run a Go application that pulls from RabbitMQ, runs some processing, and returns it. The pods scale to handle an increase of jobs. Pretty run-of-the-mill stuff.
The HPA is setup like this:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
production Deployment/production 79%/80% 2 10 10 4d11h
staging Deployment/staging 2%/80% 1 2 1 4d11h
What happens is as the HPA scales up and down, there are always 2 pods that will stay running. We've found that after running for so long, the Go app on those pods will time out. Sometimes that's days, sometimes it weeks. Yes, we could probably dig into the code and figure this out, but it's kind of a low priority for that team.
Another solution I've thought of would be to have the HPA remove the oldest pods first. This would mean that the oldest pod would never be more than a few hours old. A first-in, first-out model.
However, I don't see any clear way to do that. It's entirely possible that it isn't, but it seems like something that could work.
Am I missing something? Is there a way to make this work?
In my opinion(I also mentioned in comment) - the most simple(not sure about elegance) way is to have some cronjob that will periodically clean timed out pods.
One CronJob object is like one line of a crontab (cron table) file. It
runs a job periodically on a given schedule, written in Cron format.
CronJobs are useful for creating periodic and recurring tasks, like
running backups or sending emails. CronJobs can also schedule
individual tasks for a specific time, such as scheduling a Job for
when your cluster is likely to be idle.
Cronjobs examples and howto:
How To Use Kubernetes’ Job and CronJob
Kubernetes: Delete pods older than X days
https://github.com/dignajar/clean-pods <-- real example

How can a file inside a pod be copied to the outside?

I have an audit pod, which has logic to generate a report file. Currently, this file is present in the pod itself. I have only one pod having only one replica.
I know, I can run kubectl cp to copy those files from my pod. This command has to be executed on the Kubernetes node itself, but the task is to copy the file from the pod itself due to many restrictions.
I cannot use a Persistent Volume due to restrictions. I checked the Kubernetes API, but couldn't find anything by which I can do a copy.
Is there another way to copy that file out of the pod?
This is a community wiki answer posted to sum up the whole scenario and for better visibility. Feel free to edit and expand on it.
Taking under consideration all the mentioned restrictions:
not supposed to use the Kubernetes volumes
no cloud storage
pod names not accessible to your user
no sidecar containers
the only workaround for your use case is the one you currently use:
the dynamic PV with the annotations."helm.sh/resource-policy": keep
use PVCs and explicitly mention the user to not to delete the
namespace
If any one has a better idea. Feel free to contribute.

Kubernetes rolling deploy: terminate a pod only when there are no containers running

I am trying to deploy updates to pods. However I want the current pods to terminate only when all the containers inside the pod have terminated and their process is complete.
The new pods can keep waiting to start untill all container in the old pods have completed. We have a mechanism to stop old pods from picking up new tasks and therefore they should eventually terminate.
It's okay if twice the pods exist at some instance of time. I tried finding solution for this in kubernetes docs but wan't successful. Pointers on how / if this is possible would be helpful.
well I guess then you may have to create a duplicate kind of deployment with new image as required and change the selector in service to new deployment, which will prevent external traffic from entering pre-existing pods and new calls can go to new pods. Then later you can check for something like -
Kubectl top pods -c containers
and if the load appears to be static and low, then preferrably you can delete the old pods related deployment later.
But for this thing everytime the service selectors have to be updated and likely for keeping track of things you can append the git commit hash to the service selector to keep it unique everytime.
But rollback to previous versions if required from inside Kubernetes cluster will be difficult, so preferably you can trigger the wanted build again.
I hope this makes some sense !!

Accidentally deleted Kubernetes namespace

I have a Kubernetes cluster on google cloud. I accidentally deleted a namespace which had a few pods running in it. Luckily, the pods are still running, but the namespace is in terminations state.
Is there a way to restore it back to active state? If not, what would the fate of my pods running in this namespace be?
Thanks
A few interesting articles about backing up and restoring Kubernetes cluster using various tools:
https://medium.com/#pmvk/kubernetes-backups-and-recovery-efc33180e89d
https://blog.kubernauts.io/backup-and-restore-of-kubernetes-applications-using-heptios-velero-with-restic-and-rook-ceph-as-2e8df15b1487
https://www.digitalocean.com/community/tutorials/how-to-back-up-and-restore-a-kubernetes-cluster-on-digitalocean-using-heptio-ark
https://www.revolgy.com/blog/kubernetes-in-production-snapshotting-cluster-state
I guess they may be useful rather in future than in your current situation. If you don't have any backup, unfortunately there isn't much you can do.
Please notice that in all of those articles they use namespace deletion to simulate disaster scenario so you can imagine what are the consequences of such operation. However the results may not be seen immediately and you may see your pods running for some time but eventually namespace deletion removes all kubernetes cluster resources in a given namespace including LoadBalancers or PersistentVolumes. It may take some time. Some resource may not be deleted because it is still used by another resource (e.g. PersistentVolume by running Pod).
You can try and run this script to dump all your resources that are still available to yaml files however some modification may be needed as you will not be able to list objects belonging to deleted namespace anymore. You may need to add --all-namespaces flag to list them.
You may also try to dump any resource which is still available manually. If you still can see some resources like Pods, Deployments etc. and you can run on them kubectl get you may try to save their definition to a yaml file:
kubectl get deployment nginx-deployment -o yaml > deployment_backup.yaml
Once you have your resources backed up you should be able to recreate your cluster more easily.
backup most resource configuration reguarly:
kubectl get all --all-namespaces -o yaml > all-deploy-resources.yaml
but this is not includes all resources.
another ways
by ark/velero:
https://github.com/vmware-tanzu/velero (Backup and migrate Kubernetes applications and their persistent volumes https://velero.io)