Pod is still running after I delete the parent job - kubernetes

I created a job in my kubernetes cluster, the job takes a long time to finish, I decided to cancel it, so I deleted the job, but I noticed the associated pod is NOT automatically deleted. Is this the expected behavior? why is it not consistent with deployment deletion? Is there a way to make pod automatically deleted?

If you're deleting a deployment, chances are you don't want any of the underlying pods, so it most likely forcefully deletes the pods by default. Also, the desired state of pods would be unknown.
On the other hand if you're deleting a pod, it doesn't know what kind of replication controller may be attached to it and what it is doing next. So it signals a shutdown to the container so that it can perhaps clean up gracefully. There may be processes that are still using the pod, like a web request etc. and it would not be good to kill their request if it may take a second to complete. This is what happens if you may be scaling up your pods or rolling out a new deployment, and you don't want any of the users to experience any downtime. This is in fact one of the benefits of Kubernetes, as opposed to a traditional application server which requires you to shutdown the system to upgrade (or to play with load balancers to redirect traffic) which may negatively affect users.

Related

Is it okay to change the pod eviction timeout?(k8s, openshift)

I want to know about pod eviction timeout. I've already read k8s, openshift manual and some blog.
but i couldn't find an article on impact of reducing pod-eviction-timeout.(default : 5m)
I think there is a reason why the default value is 5 minutes. but I can't find reason...
Can you tell me how it will affect k8s cluster if I change the settings?
(EX: Change pod-eviction-time: 2minute or less)
refer: we have openshift(okd) cluster and it has many services.
if the 5m timeout is a valid choice or not depends on your services and your infrastructure.
there are multiple reasons for a pod to be evicted like node pressure, scheduling priorities due to resource limits, priorityClasses, taints/tolerations, etc. basically, pods will be evicted on some kind of failure or on some kind of scheduling event that can also be initiated by a user.
if you change the timeout, kubernetes will not wait as long to forcefully kill the processes during the eviction. that can lead to some unwanted behaviour with stateful services, because it may not have enough time to be shutdown gracefully and the attached volume may not be available in time when the pod is scheduled on another node again. with stateless services everything is easier, so there won't be such problems.
in short: if you are running stateless services, this should not lead to any problems. if you have stateful services, it may cause problems, but that cannot be answered generally. you gotta test it and see what happens, because you (and your team) know your services best.

How to tell Kubernetes to not reschedule a pod unless it dies?

Kubernetes tends to assume apps are small/lightweight/stateless microservices which can be stopped on one node and restarted on another node with no downtime.
We have a slow starting (20min) legacy (stateful) application which, once run as a set of pod should not be rescheduled without due cause. The reason being all user sessions will be killed and the users will have to login again. There is NO way to serialize the sessions and externalize them. We want 3 instances of the pod.
Can we tell k8s not to move a pod unless absolutely necessary (i.e. it dies)?
Additional information:
The app is a tomcat/java monolith
Assume for the sake of argument we would like to run it in Kubernetes
We do have a liveness test endpoint available
There is no benefit, if you tell k8s to use only one pod. That is not the "spirit" of k8s. In this case, it might be better to use a dedicated machine for your app.
But you can assign a pod to a special node - Assigning Pods to Nodes. The should be necessary only, when special hardware requirements are needed (e.g. the AI-microservice needs a GPU, which is only on node xy).
k8s don't restart your pod for fun. It will restart it, when there is a reason (node died, app died, ...) and I never noticed a "random reschedule" in a cluster. It is hard to say, without any further information (like deployment, logs, cluster) what exactly happened to you.
And for your comment: There are different types of recreation, one of them starts a fresh instance and will kill the old one, when the startup was successfully. Look here: Kubernetes deployment strategies
All points together:
Don't enforce a node to your app - k8s will "smart" select the node.
There are normally no planned reschedules in k8s.
k8s will recreate pods only, if there is a reason. Maybe your app didn't answer on the liveness-endpoint? Or someone/something deleting your pod?

Is it ever safe to force-delete a kubernetes pod?

https://kubernetes.io/docs/concepts/workloads/pods/pod/#force-deletion-of-pods
This section of the kubernetes documentation points out that "Force deletions can be potentially dangerous for some pods", but doesn't really go into detail on the dangers.
I understand that force deleting a pod will immediately "deregister" the pod from the API before the kubelet container confirms the underlying container is actually deleted, which could lead to a bunch of orphaned containers running if the kubelet fails to delete them. However, I don't know how to tell if a pod is "dangerous" to force-delete before I do so, or if there is even a way to predict this.
Are there any guidelines on safely force-deleting a pod? Or is this just an inherently unsafe operation?
It really depends on what point of view.
From the K8s master and etcd which keeps the state in K8s it's safe as the entry is deleted in etcd.
However, the kube-scheduler tells the kubelet on the node to kill the pod and sometimes the kubelet might not be able to kill it (Most of the times it is).
A reason why it might not be able to kill the pod is if something like docker or your runtime isn't responding or a Linux system resource is not being released which could be anything like a deadlock, hardware failure, etc.
So most of the times it's safe but there might be a few specific cases where it's not due to the nature of your application and the state of your system.

Openshift: trigger pods restart sequentially

My application loads data during startup. So I need to restart application to change data.
Data is loaded from Oracle schema and can be changed by other application.
If the data is changed, application becomes partially functional and needs to be restarted.
Requirement: restart should be done automatically without downtime (old pod should be killed, when new one pass readiness check).
How this requirement can be fulfilled?
Notes:
I would really like to use liveness probe to check some URL with health check. Issue: AFAIK liveness probe kills pod as soon as check fails. So all
pods will be killed simultaneously, which leads to a downtime
during startup.
The desired behavior can be reached by a rolling deployment. However I don't want to perform it manually.
I don't want to implement loading data during pod operation for simplicity: it can load data only during startup. If pod state is not fully functional, it is killed and recreated.
2 ways i can think of :
- Use statefulsets, the pods will be restarted in order and killed in reverse order.
- You can use deployment's spec.strategy.type = RollingUpgrade and pair it with maxUnavailable to greater than 1.
.spec.strategy.rollingUpdate.maxUnavailable

What happens when the Kubernetes master fails?

I've been trying to figure out what happens when the Kubernetes master fails in a cluster that only has one master. Do web requests still get routed to pods if this happens, or does the entire system just shut down?
According to the OpenShift 3 documentation, which is built on top of Kubernetes, (https://docs.openshift.com/enterprise/3.2/architecture/infrastructure_components/kubernetes_infrastructure.html), if a master fails, nodes continue to function properly, but the system looses its ability to manage pods. Is this the same for vanilla Kubernetes?
In typical setups, the master nodes run both the API and etcd and are either largely or fully responsible for managing the underlying cloud infrastructure. When they are offline or degraded, the API will be offline or degraded.
In the event that they, etcd, or the API are fully offline, the cluster ceases to be a cluster and is instead a bunch of ad-hoc nodes for this period. The cluster will not be able to respond to node failures, create new resources, move pods to new nodes, etc. Until both:
Enough etcd instances are back online to form a quorum and make progress (for a visual explanation of how this works and what these terms mean, see this page).
At least one API server can service requests
In a partially degraded state, the API server may be able to respond to requests that only read data.
However, in any case, life for applications will continue as normal unless nodes are rebooted, or there is a dramatic failure of some sort during this time, because TCP/ UDP services, load balancers, DNS, the dashboard, etc. Should all continue to function for at least some time. Eventually, these things will all fail on different timescales. In single master setups or complete API failure, DNS failure will probably happen first as caches expire (on the order of minutes, though the exact timing is configurable, see the coredns cache plugin documentation). This is a good reason to consider a multi-master setup–DNS and service routing can continue to function indefinitely in a degraded state, even if etcd can no longer make progress.
There are actions that you could take as an operator which would accelerate failures, especially in a fully degraded state. For instance, rebooting a node would cause DNS queries and in fact probably all pod and service networking functionality until at least one master comes back online. Restarting DNS pods or kube-proxy would also be bad.
If you'd like to test this out yourself, I recommend kubeadm-dind-cluster, kind or, for more exotic setups, kubeadm on VMs or bare metal. Note: kubectl proxy will not work during API failure, as that routes traffic through the master(s).
Kubernetes cluster without a master is like a company running without a Manager.
No one else can instruct the workers(k8s components) other than the Manager(master node)(even you, the owner of the cluster, can only instruct the Manager)
Everything works as usual. Until the work is finished or something stopped them.(because the master node died after assigning the works)
As there is no Manager to re-assign any work for them, the workers will wait and wait until the Manager comes back.
The best practice is to assign multiple managers(master) to your cluster.
Although your data plane and running applications does not immediately starts breaking but there are several scenarios where cluster admins will wish they had multi-master setup. Key to understanding the impact would be understanding which all components talk to master for what and how and more importantly when will they fail if master fails.
Although your application pods running on data plane will not get immediately impacted but imagine a very possible scenario - your traffic suddenly surged and your horizontal pod autoscaler kicked in. The autoscaling would not work as Metrics Server collects resource metrics from Kubelets and exposes them in Kubernetes apiserver through Metrics API for use by Horizontal Pod Autoscaler and vertical pod autoscaler ( but your API server is already dead).If your pod memory shoots up because of high load then it will eventually lead to getting killed by k8s OOM killer. If any of the pods die, then since controller manager and scheduler talks to API Server to watch for current state of pods so they too will fail. In short a new pod will not be scheduled and your application may stop responding.
One thing to highlight is that Kubernetes system components communicate only with the API server. They don’t
talk to each other directly and so their functionality themselves could fail I guess. Unavailable master plane can mean several things - failure of any or all of these components - API server,etcd, kube scheduler, controller manager or worst the entire node had crashed.
If API server is unavailable - no one can use kubectl as generally all commands talk to API server ( meaning you cannot connect to cluster, cannot login into any pods to check anything on container file system. You will not be able to see application logs unless you have any additional centralized log management system).
If etcd database failed or got corrupted - your entire cluster state data is gone and the admins would want to restore it from backups as early as possible.
In short - a failed single master control plane although may not immediately impact traffic serving capability but cannot be relied on for serving your traffic.