changing Pod priority without restarting the Pod? - kubernetes

I am trying to change priority of an existing Kubernetes Pod using 'patch' command, but it returns error saying that this is not one of the fields that can be modified. I can patch the priority in the Deployment spec, but it would cause the Pod to be recreated (following the defined update strategy).
The basic idea is to implement a mechanism conceptually similar to nice levels (for my application), so that certain Pods can be de-prioritized based on certain conditions (by my controller), and preempted by the default scheduler in case of resource congestion. But I don't want them to be restarted if there is no congestion.
Is there a way around it, or there is something inherent in the way scheduler works that would prevent something like this from working properly?

Priority values are applied to a pod based on the priority value of the PriorityClass assigned to their deployment at the time that the pod is scheduled. Any changes made to the PriorityClass will not be applied to pods which have already been scheduled, so you would have to redeploy the pod for the priority to take effect anyway.

As far I know,
Pod priority will work on when pod is getting scheduled.
First you need to create PriorityClasses
Create Pod with priorityClassName and mention priorityclass in pod definition.
If you are trying to add priority to already scheduled pod I will not work.
For reference: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#pod-priority

Related

The extention of updateable fields in Pod

I wonder why only some fields of a pod can be modified。
The updateable fields are:
spec.containers[*].image
spec.initContainers[*].image
spec.activeDeadlineSeconds
1
In my actual business, I have a need to change the schedulername of a pod.
So i change the validating codes in the file of validation.go. And i created a second scheduler named kube-scheduler-test.
That is pod whose schedulername is kube-scheduler-test
When i created a new pod whose schedulername is kube-scheduler-test, then the kube-scheduler-test will update the schedulername of pod to default-scheduler.
And then default-scheduler will schedule this pod to specified node.
Could you explain why only some fields of a pod can be modified and whether my method of changing the schedulername acceptable or not?
Pods are designed to be disposable, think of them as immutable. When you want to change anything about a Pod, you create a new Pod, and throw away the old Pod.
There are many articles explaining immutable infrastructure.

How to delete a pod from Kubernetes master node?

Does anyone know how to delete pod from kubernetes master node? I have this one master node on bare-metal ubuntu server. When i'm trying to delete it with "kubectl delete pod .." or force deleting from there: https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/ it doesnt work. the pod is creating again and again...
The pods in a Statefulsets are managed by ReplicaSets and will be recreated again if the current and the desired replicas defined in the spec do not match.
The document you linked provides instructions as to how to kill the pods forcefully avoiding the graceful shutdown behaviour which can have unexpected behaviour depending on the application.
The link clearly states the pods will be recreated in the section:
Force deletions do not wait for confirmation from the kubelet that the Pod has been terminated. Irrespective of whether a force deletion is successful in killing a Pod, it will immediately free up the name from the apiserver. This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee.
If you want the pods to be stopped and new pods for the Statefulset do not get created, you need to scale down the Statefulset by changing the replicas to 0.
You can read the official docs for how to scale the Statefulset replicas.
The key to figuring out how to kill the pod will be to understand how it was created. For example, if the pod is part of a deployment with a declared replicas count as 1, Once you kill/ force kill, Kubernetes detects a mismatch between the desired state (the number of replicas defined in the deployment configuration) to the current state and will create a new pod to replace the one that was deleted - therefor in this example you will need to either scale the deployment to 0 or delete the deployment.
If we need to kill any pod we can just scale down the replica set.
kubectl scale deploy <deployment_name> --replicas=<expected_no_of_replicas>
Way of deleting pods will depends on how you created it. If you created it individually ( not part of a ReplicaSet/ReplicationController/Deployment ) then you can delete pod directly. otherwise the only option to delete is the scale option. In production setup what I believe is all are using Deployment option out of ReplicaSet/ReplicationController/Deployment( Please refer documents and understand the difference between all those three options )

kubernetes: specifying maxUnavailable in both Deployment and PDB

Assuming I have a Deployment with a specific value set to the .spec.strategy.rollingUpdate.maxUnavailable field.
Then I deploy a PodDisruptionBudget attached to the deployment above, setting its spec.maxUnavailable field to a value different to the above.
Which one will prevail?
By interpreting the documentation, it seems that it depends on the event.
For a rolling update, the Deployment's maxUnavailable will be in effect, even if the PodDisruptionBudget specifies a smaller value.
But for an eviction, the PodDisruptionBudget's maxUnavailable will prevail, even if the Deployment specifies a smaller value.
The documentation does not explicitly compare these two settings, but from the way the documentation is written, it can be deduced that these are separate settings for different events that don't interact with each other.
For example:
Updating a Deployment
Output of kubectl explain deploy.spec.strategy.rollingUpdate.maxUnavailable
Specifying a PodDisruptionBudget
Output of kubectl explain pdb.spec.maxUnavailable
Also, this is more in the spirit of how Kubernetes works. The Deployment Controller is not going to read a field of a PodDisruptionBudget, and vice versa.
But to be really sure, you would just need to try it out.
I believe they updated the docs clarifying your doubt:
Involuntary disruptions cannot be prevented by PDBs; however they do count against the budget.
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but workload resources (such as Deployment and StatefulSet) are not limited by PDBs when doing rolling upgrades. Instead, the handling of failures during application updates is configured in the spec for the specific workload resource
Caution: Not all voluntary disruptions are constrained by Pod Disruption Budgets. For example, deleting deployments or pods bypasses Pod Disruption Budgets.

Kubernetes Queue with Pod Per Work Item autoscaling

I want an application to pull an item off a queue, process the item on the queue and then destroy itself. Pull -> Process -> Destroy.
I've looked at using the job pattern Queue with Pod Per Work Item as that fits the usecase however it isn't appropriate when I need the job to autoscale aka 0/1 pods when queue is empty and scale to a point when items are added. The only way I can see doing this is via a deployment but that removes the pattern of Queue with Pod per Work Item. There must be a fresh container per item.
Is there a way to have the job pattern Queue with Pod Per Work Item but with auto-scaling?
I am a bit confused, so I'll just say this: if you don't mind a failed pod, and you wish that a failed pod will not be recreated by Kubernetes, you can do that in your code by catching all errors and exiting gracefully (not advised).
Please also note, that for deployments, the only accepted restartPolicy is always. So pods of a deployments who crash will always be restarted by Kubernetes, and will probably fails for the same reason, leading to a CrashLoopBackOff.
If you want to scale a deployment depending on the length of a RabbitMQ queue's length, check KEDA. It is an event-driven autoscaling platform.
Make sure to also check their example with RabbitMQ
Another possibility is a job/deployment, that routinely checks the length of the queue in question and executes kubectl commands to scale your deployment.
Here is the cleanest one I could find, at least for my taste

Graceful shutdown of explicitly selected pod

I would like to remove a given, exact, selected pod from a set of pods controlled by the same Replication Controller, or the same Replica Set.
The use case is the following: each pod in the set runs a stateful (but in-memory) application. I would like to remove a pod from the set on a graceful way, i.e. before the removal I would like to be sure, that there are no ongoing application sessions handled by the pod. Let's say I can solve the task of emptying the pod on application level, i.e. no new sessions are directed to the selected pod, and I can measure the number of ongoing sessions in the pod, so I can decide when to remove the pod. The hard part is to remove this pod so, that RC or RS does not replace the pod with a new one based on the value of "replicas".
I could not find a solution for this. The nearest one would be to isolate the pod from the RC or RS as suggested by http://kubernetes.io/docs/user-guide/replication-controller/#isolating-pods-from-a-replication-controller
Though, the RC or RS replaces the isolated pod with a new one, according to the same document. And I as can understand there is no way to isolate the pod and decrease the value of "replicas" on an atomic way.
I have checked the coming PetSet support, but my application does not require e.g. persistent storage, or persistent pod ID. Such features are not necessary in my case, so my application is not really a pet from this perspective.
Maybe a new pod state (e.g. "target for removal" - state name is not important for me) would make it, which could be patched via the API, and which would be considered by RC or RS when the value of "replicas" is decreased?
You can achieve this in three steps:
Add a label to all pods except the one you want to delete. Because the labels of the pods still satisfy the selector of the Replica Set, so no new pods will be created.
Update the Replica Set: adding the new label to the selector and decrease the replicas of the Replica Set atomically. The pod you want to deleted won't be selected by the Replica Set because it doesn't have the new label.
Delete the selected pod.