Make Kubernetes wait for Pod termination before removing from Service endpoints - kubernetes

According to Termination of Pods, step 7 occurs simultaneously with 3. Is there any way I can prevent this from happening and have 7 occur only after the Pod's graceful termination (or expiration of the grace period)?
The reason why I need this is that my Pod's termination routine requires my-service-X.my-namespace.svc.cluster.local to resolve to the Pod's IP during the whole process, but the corresponding Endpoint gets removed as soon as I run kubectl delete on the Pod / Deployment.
Note: In case it helps making this clear, I'm running a bunch of clustered VerneMQ (Erlang) nodes which, on termination, dump their contents to other nodes on the cluster — hence the need for the nodenames to resolve correctly during the whole termination process. Only then should the corresponding Endpoints be removed.

Unfortunately kubernetes was designed to remove the Pod from the endpoints at the same time as the prestop hook is started (see link in question to kubernetes docs):
At the same time as the kubelet is starting graceful shutdown, the
control plane removes that shutting-down Pod from Endpoints
This google kubernetes docs says it even more clearly:
Pod is set to the “Terminating” State and removed from the
endpoints list of all Services
There also was also a feature request for that. which was not recognized.
Solution for helm users
But if you are using helm, you can use hooks (e.g. pre-delete,pre-upgrade,pre-rollback). Unfortunately this helm hook is an extra pod which can not access all pod resources.
This is an example for a hook:
apiVersion: batch/v1
kind: Job
metadata:
name: graceful-shutdown-hook
annotations:
"helm.sh/hook": pre-delete,pre-upgrade,pre-rollback
labels:
app.kubernetes.io/name: graceful-shutdown-hook
spec:
template:
spec:
containers:
- name: graceful-shutdown
image: busybox:1.28.2
command: ['sh', '-cx', '/bin/sleep 15']
restartPolicy: Never
backoffLimit: 0

Maybe you should consider using headless service instead of using ClusterIP one. That way your apps will discover using the actual endpoint IPs and the removal from endpoint list will not break the availability during shutdown, but will remove from discovery (or from ie. ingress controller backends in nginx contrib)

Related

upgrading to bigger node-pool in GKE

I have a node-pool (default-pool) in a GKE cluster with 3 nodes, machine type n1-standard-1. They host 6 pods with a redis cluster in it (3 masters and 3 slaves) and 3 pods with an nodejs example app in it.
I want to upgrade to a bigger machine type (n1-standard-2) with also 3 nodes.
In the documentation, google gives an example to upgrade to a different machine type (in a new node pool).
I have tested it while in development, and my node pool was unreachable for a while while executing the following command:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
kubectl cordon "$node";
done
In my terminal, I got a message that my connection with the server was lost (I could not execute kubectl commands). After a few minutes, I could reconnect and I got the desired output as shown in the documentation.
The second time, I tried leaving out the cordon command and I skipped to the following command:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node";
done
This because if I interprete the kubernetes documentation correctly, the nodes are automatically cordonned when using the drain command. But I got the same result as with the cordon command: I lost connection to the cluster for a few minutes, and I could not reach the nodejs example app that was hosted on the same nodes. After a few minutes, it restored itself.
I found a workaround to upgrade to a new node pool with bigger machine types: I edited the deployment/statefulset yaml files and changed the nodeSelector. Node pools in GKE are tagged with:
cloud.google.com/gke-nodepool=NODE_POOL_NAME
so I added the correct nodeSelector to the deployment.yaml file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
labels:
app: example
spec:
replicas: 3
selector:
matchLabels:
app: example
template:
metadata:
labels:
app: example
spec:
nodeSelector:
cloud.google.com/gke-nodepool: new-default-pool
containers:
- name: example
image: IMAGE
ports:
- containerPort: 3000
This works without downtime, but I'm not sure this is the right way to do in a production environment.
What is wrong with the cordon/drain command, or am I not using them correctly?
Cordoning a node will cause it to be removed from the load balancers backend list, so will a drain. The correct way to do it is to set up anti-affinity rules on the deployment so the pods are not deployed on the same node, or the same region for that matter. That will cause an even distribution of pods throught your node pool.
Then you have to disable autoscaling on the old node pool if you have it enabled, slowly drain 1-2 nodes a time and wait for them to appear on the new node pool, making sure at all times to keep one pod of the deployment alive so it can handle traffic.

Kubectl get deployments, no resources

I've just started learning kubernetes, in every tutorial the writer generally uses "kubectl .... deploymenst" to control the newly created deploys. Now, with those commands (ex kubectl get deploymets) i always get the response No resources found in default namespace., and i have to use "pods" instead of "deployments" to make things work (which works fine).
Now my question is, what is causing this to happen, and what is the difference between using a deployment or a pod? ? i've set the docker driver in the first minikube, it has something to do with this?
First let's brush up some terminologies.
Pod - It's the basic building block for Kubernetes. It groups one or more containers (such as Docker containers), with shared storage/network, and a specification for how to run the containers.
Deployment - It is a controller which wraps Pod/s and manages its life cycle, which is to say actual state to desired state. There is one more layer in between Deployment and Pod which is ReplicaSet : A ReplicaSet’s purpose is to maintain a stable set of replica Pods running at any given time. As such, it is often used to guarantee the availability of a specified number of identical Pods.
Below is the visualization:
Source: I drew it!
In you case what might have happened :
Either you have created a Pod not a Deployment. Therefore, when you do kubectl get deployment you don't see any resources. Note when you create Deployments it in turn creates a ReplicaSet for you and also creates the defined pods.
Or may be you created your deployment in a different namespace, if that's the case, then type this command to find your deployments in that namespace kubectl get deploy NAME_OF_DEPLOYMENT -n NAME_OF_NAMESPACE
More information to clarify your concepts:
Source
Below the section inside spec.template is the section which is supposedly your POD manifest if you were to create it manually and not take the deployment route. Now like I said earlier in simple terms Deployments are a wrapper to your PODs, therefore anything which you see outside the path spec.template is the configuration which you will need to defined on how you want to manage (scaling,affinity, e.t.c) your POD
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Deployment is a controller providing higher level abstraction on top of pods and ReplicaSets. A Deployment provides declarative updates for Pods and ReplicaSets. Deployments internally creates ReplicaSets within which pods are created.
Use cases of deployment is documented here
One reason for No resources found in default namespace could be that you created the deployment in a specific namespace and not in default namespace.
You can see deployments in a specific namespace or in all namespaces via
kubectl get deploy -n namespacename
kubectl get deploy -A

Have kube jobs start on waiting pods

I am working on a scenario where I want to be able to maintain some X number of pods in waiting (and managed by kube) and then upon user request (via some external system) have a kube job start on one of those waiting pods. So now the waiting pods count is X-1 and kube starts another pod to bring this number back to X.
This way I'll be able to cut down on the time taken to create a pod, start a container and getting is ready to start actual processing. The processing data can be sent to those pods via some sort of messaging (akka or rabbitmq).
I think the ReplicationControllers are best place to keep idle pods, but when I create a job how can I specify that I want to be able to use one of the pods that are in waiting and are managed by ReplicationController.
I think I got this to work upto a state on top of which I can build this solution.
So what I am doing is starting a RC with replicas: X (X is the number of idle pods I wish to maintain, usually not a very large number). The pods that it starts have custom label status: idle or something like that. The RC spec.selector has the same custom label value to match with the pods that it manages, so spec.selector.status: idle. When creating this RC, kube ensures that it creates X pods with their status=idle. Somewhat like below:
apiVersion: v1
kind: ReplicationController
metadata:
name: testrc
spec:
replicas: 3
selector:
status: idle
template:
metadata:
name: idlepod
labels:
status: idle
spec:
containers:
...
On the other hand I have a job yaml that has spec.manualSelector: true (and yes I have taken into account that the label set has to be unique). With manualSelector enabled, I can now define selectors on the job like below.
apiVersion: batch/v1
kind: Job
metadata:
generateName: testjob-
spec:
manualSelector: true
selector:
matchLabels:
status: active
...
So clearly, RC creates pods with status=idle and job expects to use pods with status=active because of the selector.
So now whenever I have a request to start a new job, I'll update label on one of the pods managed by RC so that its status=active. The selector on RC will effect the release of this pod from its control and start another one because of replicas: X set on it. And the released pod is no longer controller by RC and is now orphan. Finally, when I create a job, the selector on this job template will match the label of the orphaned pod and this pod will then be controlled by the new job. I'll send messages to this pod that will start the actual processing and finally bring it to complete.
P.S.: Pardon my formatting. I am new here.

Run a pod without scheduler in Kubernetes

I searched the documentation but I am unable to find out if I can run a pod in Kubernetes without Scheduler. If anyone can help with any pointers it would be helpful
Update:
I can attach a label to node and let pod stick to that label but that would involve going through the scheduler. Is there any method without daemonset and does not use scheduler.
The scheduler just sets the spec.nodeName field on the pod. You can set that to a node name yourself if you know which node you want to run your pod, though you are then responsible for ensuring the node has sufficient resources to run the pod (enough memory, free host ports, etc… all things the scheduler is normally responsible for checking before it assigns a pod to a node)
You want static pods
Static pods are managed directly by kubelet daemon on a specific node, without API server observing it. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes.
You can simply add a nodeName attribute to the pod definition which by they normally field out by the scheduler therefor it's not a mandatory field.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- image: nginx
name: nginx
nodeName: node01
if the pod has been created and in pending state you have to recreate it with the new field, edit is not permitted with the nodeName att.
All the answers given here would require a scheduler to run.
I think what you want to do is create the manifest file of the pod and put it in the default manifest directory of the node in question.
Default directory is /etc/kubernetes/manifests/
The pod will automatically be created and if you wish to delete it, just delete the manifest file.
You can simply add a nodeName attribute to the pod definition
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
nodeName: controlplane
containers:
- image: nginx
name: nginx
Now important point - check the node listed by using below command, and then assign to one of them:
kubectl get nodes

How are minions chosen for given pod creation request

How does kubernetes choose the minion among many available for a given pod creation command? Is it something that can be controlled/tweaked ?
If replicated pods are submitted for deployment, is kubernetes intelligent enough to place them in different minions if they expose the same container/host port pair? Or does it always place different replicas in different minions ?
What about corner cases like what if two different pods (not necessarily replicas) that expose same host/container port pair are submitted? Will they carefully be placed on different minions ?
If a pod requires specific compute/memory requirements, can it be placed in a minion/host that has sufficient resources left to meet those requirement?
In summary, is there detailed documentation on kubernetes pod placement strategy?
Pods are scheduled to ports using the algorithm in generic_scheduler.go
There are rules that prevent host-port conflicts, and also to make sure that there are sufficient memory and cpu requirements. predicates.go
One way to choose minion for pod creation is using nodeSelector. Inside the yaml file of pod specify the label of minion for which you want to choose the minion.
apiVersion: v1
kind: Pod
metadata:
name: nginx1
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
key: value