Set Configure Kubernetes scheduler interval? - kubernetes

I use k8s to run large compute Jobs (to completion). There are not many jobs nor nodes in my setup so scale is small.
I noticed that the scheduler do not start a pod immediately when a node is available. It takes between 5 to 40 seconds for a pod to be scheduled once node is ready.
Is there a way to make the scheduler more "aggressive"? I cant find a setting for the interval in Default Scheduler custom policy. Is implementing my own scheduler the only way forward? Any tips appreciated!

There is difference between pod scheduling and pod creation. Scheduler only find suitable node and schedule pod to that node but pod creation done by kubelet.
Kubelet polls api-server for desired state and get newly scheduled pod spec and then create pod.
So this process can take time you Specfied in question.
So i dont think writing custom scheduler help here.

Related

Kubernetes scheduler

Does the Kubernetes scheduler assign the pods to the nodes one by one in a queue (not in parallel)?
Based on this, I guess that might be the case since it is mentioned that the nodes are iterated round robin.
I want to make sure that the pod scheduling is not being done in parallel.
Short answer
Taking into consideration all the processes kube-scheduler performs when it's scheduling the pod, the answer is yes.
Scheduler and pods
For every newly created pod or other unscheduled pods, kube-scheduler
selects an optimal node for them to run on. However, every container
in pods has different requirements for resources and every pod also
has different requirements. Therefore, existing nodes need to be
filtered according to the specific scheduling requirements.
In a cluster, Nodes that meet the scheduling requirements for a Pod
are called feasible nodes. If none of the nodes are suitable, the pod
remains unscheduled until the scheduler is able to place it.
The scheduler finds feasible Nodes for a Pod and then runs a set of
functions to score the feasible Nodes and picks a Node with the
highest score among the feasible ones to run the Pod. The scheduler
then notifies the API server about this decision in a process called
binding.
Reference - kube-scheduler.
The scheduler determines which Nodes are valid placements for each Pod
in the scheduling queue according to constraints and available
resources.
Reference - kube-scheduler - synopsis.
In short words, kube-scheduler picks up pods one by one, assess them and its requests, then proceeds to finding appropriate feasible nodes to schedule pods on.
Scheduler and nodes
Mentioned link is related to nodes to give a fair chance to run pods across all feasible nodes.
Nodes in a cluster that meet the scheduling requirements of a Pod are
called feasible Nodes for the Pod
Information here is related to default kube-scheduler, there are solutions which can be used or even it's possible to implement self-written one. Also it's possible to run multiple schedulers in cluster.
Useful links:
Node selection in kube-scheduler
Kubernetes scheduler

Kubernetes scheduling ignores pod count per worker node

We have a kubernetes cluster with three worker nodes, which was built manually, borrowing from the 'Kubernetes, the hard way' Tutorial.
Everything on this cluster works as expected for one exception:
The scheduler does not - or seems not to - honor the 110 pod per worker node limit.
Example:
Worker Node 1: 60 pods
Worker Node 2: 100 pods
Worker Node 3: 110 pods
When I want to deploy a new pod, it often happens that the scheduler decides it would be best to schedule the new pod to 'Worker Node 3'. Kubelet refuses to do so, it does honor its 110 pod limitation. The scheduler tries again and again and never succeeds.
I do not understand why this is happening. I think I might be missing some detail about this problem.
From my understanding and what I have read about the scheduler itself, there is no resource or metric for 'amount of pods per node' which is considered while scheduling - or at least I haven't found anything that would suggest otherwise in the Kubernetes Scheduler documentation. Of course the scheduler considers CPU requests/limits, memory requests/limits, disk requests/limits - that's all fine and working. So I don't even know how the scheduler could ever consider the amount of pods used on a worker, but there has to be some kind of functionality doing that, right? Or am I mistaken?
Is my cluster broken? Is there some misconception I have about how scheduling should/does work?
Kubernetes binary versions: v1.17.2
Edit: Kubernetes version
Usually this means the other nodes are unsuitable. Either explicitly via taints, etc or more often things like resource request space.

K8s - Schedule new pod before the old one is terminated

I have read up on the Kubernetes docs but I'm unable to get a clear answer on my question. I'm using the official cluster-autoscaler.
I have a deployment that specifies one replica should be running. When a pod is terminated (for example, was running on a node that is getting scaled-down) is the new pod scheduled before the termination begins or after the termination is done? The docs say that schedule happens when terminating, but don't mention at which phase.
To achieve seamless node scale-down without disruption to any services, I would expect k8 to scale up pods to N+1 replicas (at this point pods are scheduled only to nodes that are not scaling down) and then drain the node. Based on my testing, it first drains, and then schedules any missing pods based on configurations. Is it possible to configure this behaviour or this is currently not possible to do?
From what I understand, seamless updates are easy with RollingUpdate strategy. I have not find the same "Rolling" strategy to be possible for scale-down.
EDIT
TL;DR I'm looking for HA on a) two+ replica deployment and b) one replica deployment
a) Can be achieved by using PDBs. Checkout Fritz's answer. If you need pods scheduled on different nodes, leverage anti-affinity (Marc's answer)
b) If you're okay with short disruption, PDB is the official way to go. If you need a workaround, my answer can be of inspiration.
The scale down behavior can be configured with what is called a Disruption Budget
In your Deployment Manifest you can define maxUnavailable and minAvailable number of Pods during voluntary disruptions like draining nodes.
For how to do it, check out the K8s Documentation.
Below are some insight, hope this will help :
If you use a deployment, then the scheduler checks that you always have the desired number of replicas running. No less, no more. So when you kill a node (which have one of your replicas), the new pod will be scheduled after the termination of one of your original replicas. It's up to you to anticipate if it's a planified maintenance.
If you have lots of nodes (meaning more than one) and want to achieve HA (high availability) for your deployments, then you should have a look at pod affinity/anti-affinity. You can find out more in the official doc
Hate to answer my own question, but an easy solution to high-availability service with only one pod (not wasting resources with running one idle replica) is to use PreStop hook (to make the action blocking if proper SIGTERM handling is not implemented) together with terminationGracePeriodSeconds with enough time for the other service to start.
Contradicting to what has been said here, the scheduling happens when pod is terminating. After quick testing (should have done that together with reading docs) where I created a busybox (sh sleep 3600) deployment with one replica and terminationGracePeriodSeconds set to 240 seconds.
By deleting the pod, it will enter the Terminating state and stay in that state for 240 seconds. Immediately after marking the pod as Terminating, new pod was scheduled instead of it.
So the previous pod has time to finish whatever it is doing and the other one can seamlessly take its place.
I haven't tested how will the networking behave since LB will stop sending new requests, but I assume the downtime will be much lower than without the terminationGracePeriodSeconds set to a higher amount than the default.
Beware that is not official by any means but serves as a workaround for my use case.

K8s scheduling timeout

I have cluster-autoscaler integrated with K8s and worker nodes are scaled from 0. I want to not schedule more than 2 pods per node. I've set pod limit on kubelet level. When I am running 5 jobs in parallel with 4 pods limit It scales up 3 nodes , but It's trying to schedule only on 2 nodes and one pod in going down due limit. Is there any scheduling limit parameter in K8s? To schedule pods only after specific time (some sleep parameter)? We need to wait when all workers become ready
It's not possible with the default scheduler that kubernetes comes with. You need to implement custom scheduler and write the logic in the scheduler to cater to this use case.
We need to wait when all workers become ready
Seems to me you need to use readiness probes on your pods

Can a pod be forcefully marked as "terminating" using liveness probe or any other way

We have a use case in our application where we need the pod to terminate after it processes a request. The corresponding deployment will take care of spinning up a new pod to maintain the replica count.
I was exploring to use liveness probes, but they only restart the containers and not the pods.
Is there any other way to terminate the pod, from service level or deployment level?
You need to get familiar with Pod lifetime
In general, Pods do not disappear until someone destroys them. This
might be a human or a controller. The only exception to this rule is
that Pods with a phase of Succeeded or Failed for more than some
duration (determined by terminated-pod-gc-threshold in the master)
will expire and be automatically destroyed.
In your case consider using Jobs.
Use a Job for Pods that are expected to terminate, for example, batch
computations. Jobs are appropriate only for Pods with restartPolicy
equal to OnFailure or Never.
Please let me know if that helped.