Stateful jobs in Kubernetes - kubernetes

I have a requirement to run an ad-hoc job, once in a while. The job needs some state to work. Building the state takes a lot of time. So, it is desired to keep the state persistent and reusable in subsequent runs, for a fast turnaround time. I want this job to be managed as K8s pods.
This is a complete set of requirements:
Pods will go down after work finish. The K8s controller should not try to bring up the pods.
Each pod should have a persistent volume attached to it. There should be 1 volume per pod. I am planning to use EBS.
We should be able to manually bring the pods back up in future.
Future runs may have more or less replicas than the past runs.
I know K8s supports both Jobs and Statefulsets. Is there any Controller which supports both at the same time?

Pods will go down after work finish. The K8s controller should not try
to bring up the pods.
This is what Jobs do - run to completion. You only control whether you wanna retry on exit > 0.
Pods should have a persistent volume attached to
them.
Same volume to all? Will they write or only read? What volume backend do you have, AWS EBS or similar? Depending of answers you might want to split input data between few volumes or use separate volumes to write and then finalization job to assemble in 1 volume (kind of map reduce). Or use volume backend which supports multi-mount RW https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes (see table for ReadWriteMany)
We should be able to manually bring the pods back up in future.
Jobs fit here: You launch it when you need it, and it runs till completion.
Future runs may have more or less replicas than the past runs.
Jobs fit here. Specify different completions or parallelism when you launch a job: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#parallel-jobs
StatefulSets are different concept, they mostly used for clustered software which you run continuously and need to persist the role per pod (e.g. shard).

Related

Why would the Kubernetes scheduler always place my Pod replicas on the same node in AKS?

We have an AKS test cluster with four Windows worker nodes and a Deployment with a replica count of two. The corresponding Pod spec does not specify any resource requests and limits (thus, the resulting Pods are in the BestEffort QoS class).
In order to conduct a performance test, we scaled all other Deployments on those worker nodes to 0 replicas and deleted all remaining Pods on the nodes. Only the system Pods created by AKS DaemonSets itself (in the kube-system namespace) remained. We then created the Deployment mentioned above.
We had assumed that the default Kubernetes scheduler would place the two replicas on different nodes by default, or at least choose nodes randomly. However, the scheduler always chose the same node to place both replicas on, no matter how often we deleted the Pods or scaled the Deployment to 0 and back again to 2. Only after we tainted that node as NoSchedule, did the scheduler choose another node.
I know I could configure anti-affinities or topology spread constraints to get a better spreading of my Pods. But in the Cloud Native DevOps with Kubernetes book, I read that the scheduler actually does a very good job by default and one should only use those features if absolutely necessary. (Instead maybe using the descheduler if the scheduler is forced to make bad decisions.)
So, I would like to understand why the behavior we observed would happen. From the docs, I've learned that the scheduler first filters the nodes for fitting ones. In this case, all of them should fit, as all are configured identically. It then scores the nodes, choosing randomly if all have the same score. Why would one node always win that scoring?
Follow-up question: Is there some way how I could reconstruct the scheduler's decision logic in AKS? I can see kube-scheduler logs in Container Insights, but they don't contain any information regarding scheduling, just some operative stuff.
I believe that the scheduler is aware of which Nodes already have the container images pulled down, and will give them preference to avoid the image pull (and thus faster start time)
Short of digging up the source code as proof, I would guess one could create a separate Pod (for this purpose, I literally mean kind: Pod), force it onto one of the other Nodes via nodeName:, then after the Pod has been scheduled and attempted to start, delete the Pod and scale up your Deployment
I would then expect the new Deployment managed Pod to arrive on that other Node because it by definition has less resources in use but also has the container image required
Following mdaniel's reply, which I've marked as the accepted answer, we've done some more analysis and have found the list of scheduling plugins and the scheduling framework docs. Reading the code, we can see the ImageLocality plugin assigns a very high score due to the Windows container images being really large. As we don't have resource requests, the NodeResourcesFit plugin will not compensate this.
We did not find a plugin that would strive to not put Pod replicas onto the same node (unless configured via anti-affinities or a PodTopologySpreadConstraint). Which surprised me, as that would seem to be a good default to me?
Some experimentation shows that the situation indeed changes, once we, for example, start adding (even minimal) resource requests.
In the future, we'll therefore assign resource requests (which is good practice anyway) and, if this isn't enough, follow up with PodTopologySpreadConstraints.

Schedule as many pods as will fit in the cluster?

I've got a batch job to run: process a large number of media files. I have a Kubernetes cluster to run it on, but I don't want to change the size of the cluster. I want to run the processing as a low-priority job. Any time there are spare compute resources, they should work on media-processing. Any time there are other jobs that need resources, the media process should be suspended.
Currently, I'm running a Deployment with one replica for each node in my cluster. I defined a PriorityClass for the batch-job and a different PriorityClass (with higher priority) for everything else. That seems to be working to evict running batch-jobs when something else needs the resources.
I define a Affinity, specifically a WeightedPod(Anti)Affinity to discourage the batch-job from scheduling on the same machine.
The code itself is a queue-worker: it pulls one work-item off a shared queue and processes it and then goes back for the next. If it gets interrupted (because it's being evicted) the partial work is lost (which is fine).
This is working OK, but I'm leaving a lot of resources on the table, still. Is there some way to define my replica-count as "as many as you can schedule"? I could ask for far more replicas than the cluster can handle; would that be a good solution? Or are there problems with Kubernetes having 10 pods stuck "pending" for months at a time?
I think there's no harm in asking for more pods than the cluster can handle and keeping them pending forever. My only concern is whether the scheduler will be able to discern normal priority pending pods over low priority pending pods, and be able to give precedence to the more urgent ones.
The pro way to go about this issue, IMHO, is to leverage prometheus adapter and use an HPA to target the current capacity of your cluster using a prometheus query. This can give you continuous of the cluster capacity and the ability to autoscale accordingly. This medium article has a pretty good introduction to the concept.

K8s - Schedule new pod before the old one is terminated

I have read up on the Kubernetes docs but I'm unable to get a clear answer on my question. I'm using the official cluster-autoscaler.
I have a deployment that specifies one replica should be running. When a pod is terminated (for example, was running on a node that is getting scaled-down) is the new pod scheduled before the termination begins or after the termination is done? The docs say that schedule happens when terminating, but don't mention at which phase.
To achieve seamless node scale-down without disruption to any services, I would expect k8 to scale up pods to N+1 replicas (at this point pods are scheduled only to nodes that are not scaling down) and then drain the node. Based on my testing, it first drains, and then schedules any missing pods based on configurations. Is it possible to configure this behaviour or this is currently not possible to do?
From what I understand, seamless updates are easy with RollingUpdate strategy. I have not find the same "Rolling" strategy to be possible for scale-down.
EDIT
TL;DR I'm looking for HA on a) two+ replica deployment and b) one replica deployment
a) Can be achieved by using PDBs. Checkout Fritz's answer. If you need pods scheduled on different nodes, leverage anti-affinity (Marc's answer)
b) If you're okay with short disruption, PDB is the official way to go. If you need a workaround, my answer can be of inspiration.
The scale down behavior can be configured with what is called a Disruption Budget
In your Deployment Manifest you can define maxUnavailable and minAvailable number of Pods during voluntary disruptions like draining nodes.
For how to do it, check out the K8s Documentation.
Below are some insight, hope this will help :
If you use a deployment, then the scheduler checks that you always have the desired number of replicas running. No less, no more. So when you kill a node (which have one of your replicas), the new pod will be scheduled after the termination of one of your original replicas. It's up to you to anticipate if it's a planified maintenance.
If you have lots of nodes (meaning more than one) and want to achieve HA (high availability) for your deployments, then you should have a look at pod affinity/anti-affinity. You can find out more in the official doc
Hate to answer my own question, but an easy solution to high-availability service with only one pod (not wasting resources with running one idle replica) is to use PreStop hook (to make the action blocking if proper SIGTERM handling is not implemented) together with terminationGracePeriodSeconds with enough time for the other service to start.
Contradicting to what has been said here, the scheduling happens when pod is terminating. After quick testing (should have done that together with reading docs) where I created a busybox (sh sleep 3600) deployment with one replica and terminationGracePeriodSeconds set to 240 seconds.
By deleting the pod, it will enter the Terminating state and stay in that state for 240 seconds. Immediately after marking the pod as Terminating, new pod was scheduled instead of it.
So the previous pod has time to finish whatever it is doing and the other one can seamlessly take its place.
I haven't tested how will the networking behave since LB will stop sending new requests, but I assume the downtime will be much lower than without the terminationGracePeriodSeconds set to a higher amount than the default.
Beware that is not official by any means but serves as a workaround for my use case.

Pod is still running after I delete the parent job

I created a job in my kubernetes cluster, the job takes a long time to finish, I decided to cancel it, so I deleted the job, but I noticed the associated pod is NOT automatically deleted. Is this the expected behavior? why is it not consistent with deployment deletion? Is there a way to make pod automatically deleted?
If you're deleting a deployment, chances are you don't want any of the underlying pods, so it most likely forcefully deletes the pods by default. Also, the desired state of pods would be unknown.
On the other hand if you're deleting a pod, it doesn't know what kind of replication controller may be attached to it and what it is doing next. So it signals a shutdown to the container so that it can perhaps clean up gracefully. There may be processes that are still using the pod, like a web request etc. and it would not be good to kill their request if it may take a second to complete. This is what happens if you may be scaling up your pods or rolling out a new deployment, and you don't want any of the users to experience any downtime. This is in fact one of the benefits of Kubernetes, as opposed to a traditional application server which requires you to shutdown the system to upgrade (or to play with load balancers to redirect traffic) which may negatively affect users.

Kubernetes batch performance with activation of thousands of pods using jobs

I am writing a pipeline with kubernetes in google cloud.
I need to activate sometimes a few pods in a second, where each pod is a task that runs inside a pod.
I plan to call kubectl run with Kubernetes job and wait for it to complete (poll every second all the pods running) and activate the next step in the pipeline.
I will also monitor the cluster size to make sure I am not exceeding the max CPU/RAM usage.
I can run tens of thousands of jobs at the same time.
I am not using standard pipelines because I need to create a dynamic number of tasks in the pipeline.
I am running the batch operation so I can handle the delay.
Is it the best approach? How long does it take to create a pod in Kubernetes?
If you wanna run ten thousands of jobs at the same time - you will definitely need to plan resource allocation. You need to estimate the number of nodes that you need. After that you may create all nodes at once, or use GKE cluster autoscaler for automatically adding new nodes in response to resource demand. If you preallocate all nodes at once - you will probably have high bill at the end of month. But pods can be created very quickly. If you create only small number of nodes initially and use cluster autoscaler - you will face large delays, because nodes take several minutes to start. You must decide what your approach will be.
If you use cluster autoscaler - do not forget to specify maximum nodes number in cluster.
Another important thing - you should put your jobs into Guaranteed quality of service in Kubernetes. Otherwise if you use Best Effort or Burstable pods - you will end up with Eviction nightmare which is really terrible and uncontrolled.