reversing the order of pod recreation in Kubernetes Vertical Pod Autoscaler - kubernetes

I am trying to use VPA for autoscaling my deployed services. Due to limitation in resources in my cluster I set the min_replica option to 1. The workflow of VPA that have seen so far is that it first deletes the existing pod and then re-create the pod. This approach will cause a downtime to my services. What I want is that the VPA first create the new pod and then deletes the old pod, completely similar to the rolling updates for deployments. Is there an option or hack to reverse the flow to the desired order in my case?

This can be achieved by using python script or by using an IAC pipeline, you can get the metrics of the kubernetes cluster and whenever these metrics exceed a certain threshold, trigger this python code for creating new pod with the required resources and shutdown the old pod. Follow this github link for more info on python plugin for kubernetes.
Ansible can also be used for performing this operation. This can be achieved by triggering your ansible playbook whenever the threshold breaches a certain limit and you can specify the new sizes of the pods that need to be created. Follow this official ansible document for more information. However both these procedures involve manual analysis for selecting the desired pod size for scaling. So if you don’t want to use vertical scaling you can go for horizontal scaling.
Note: The information is gathered from official Ansible and github pages and the urls are referred to in the post.

Related

Why would the Kubernetes scheduler always place my Pod replicas on the same node in AKS?

We have an AKS test cluster with four Windows worker nodes and a Deployment with a replica count of two. The corresponding Pod spec does not specify any resource requests and limits (thus, the resulting Pods are in the BestEffort QoS class).
In order to conduct a performance test, we scaled all other Deployments on those worker nodes to 0 replicas and deleted all remaining Pods on the nodes. Only the system Pods created by AKS DaemonSets itself (in the kube-system namespace) remained. We then created the Deployment mentioned above.
We had assumed that the default Kubernetes scheduler would place the two replicas on different nodes by default, or at least choose nodes randomly. However, the scheduler always chose the same node to place both replicas on, no matter how often we deleted the Pods or scaled the Deployment to 0 and back again to 2. Only after we tainted that node as NoSchedule, did the scheduler choose another node.
I know I could configure anti-affinities or topology spread constraints to get a better spreading of my Pods. But in the Cloud Native DevOps with Kubernetes book, I read that the scheduler actually does a very good job by default and one should only use those features if absolutely necessary. (Instead maybe using the descheduler if the scheduler is forced to make bad decisions.)
So, I would like to understand why the behavior we observed would happen. From the docs, I've learned that the scheduler first filters the nodes for fitting ones. In this case, all of them should fit, as all are configured identically. It then scores the nodes, choosing randomly if all have the same score. Why would one node always win that scoring?
Follow-up question: Is there some way how I could reconstruct the scheduler's decision logic in AKS? I can see kube-scheduler logs in Container Insights, but they don't contain any information regarding scheduling, just some operative stuff.
I believe that the scheduler is aware of which Nodes already have the container images pulled down, and will give them preference to avoid the image pull (and thus faster start time)
Short of digging up the source code as proof, I would guess one could create a separate Pod (for this purpose, I literally mean kind: Pod), force it onto one of the other Nodes via nodeName:, then after the Pod has been scheduled and attempted to start, delete the Pod and scale up your Deployment
I would then expect the new Deployment managed Pod to arrive on that other Node because it by definition has less resources in use but also has the container image required
Following mdaniel's reply, which I've marked as the accepted answer, we've done some more analysis and have found the list of scheduling plugins and the scheduling framework docs. Reading the code, we can see the ImageLocality plugin assigns a very high score due to the Windows container images being really large. As we don't have resource requests, the NodeResourcesFit plugin will not compensate this.
We did not find a plugin that would strive to not put Pod replicas onto the same node (unless configured via anti-affinities or a PodTopologySpreadConstraint). Which surprised me, as that would seem to be a good default to me?
Some experimentation shows that the situation indeed changes, once we, for example, start adding (even minimal) resource requests.
In the future, we'll therefore assign resource requests (which is good practice anyway) and, if this isn't enough, follow up with PodTopologySpreadConstraints.

Kubernetes Job to create a volume snapshot

I have a job, which I want to run regularly in Kubernetes 1.19.3 (DigitalOcean).
For this job, I need to take a snapshot of a PVC and do stuff to it. I know how can I run a job and mount a volume to the pod it runs, but I have a hard time finding out how to take that snapshot at the beginning of this job.
Is there any way to do it?
The tool of choice to take PV snapshots in K8s is VolumeSnapshots.
The trouble with them is that they don't come yet) with functionality for periodic triggering. So, you would have to create them from a K8s CronJob. However, doing so is not terribly straight forward, since your CronJob Pod would need to have a K8s client installed and require access to the K8s API Server with RBAC.
There are a couple of options to get there, reaching from writing your own image from scratch to using open-source solutions based on the clients from this project k8s client libraries.
Seeing that dynamic K8s manifest applying is somewhat badly supported by K8s, I actually started an open source project myself, that you could use for this purpose: K8sCrud.

Is it possible to schedule a pod to run for say 24 hours and then remove deployment/statefulset? or need to use jobs?

We have a bunch of pods running in dev environment. The pods are auto-provisioned by an application on every business action. The problem is that across various namespaces they are accumulating and eating available resources in EKS.
Is there a way without jenkins/k8s jobs to simply put some parameter on the pod manifest to tell it to self destruct say in 24 hours?
Add to your pod.spec:
activeDeadlineSeconds: 86400
After deadline your Pod will be stopped for good with the status DeadlineExceeded
If I understood your situation properly, you would like to scale your cluster down in order to save resources.
Kubernetes is featured with the ability to autoscale your application in a cluster. Literally, it means that Kubernetes can start additional pods when the load is increasing and terminate excessive pods when the load is decreasing.
It is possible to downscale the application to zero pods, but, in this case, you will have a delay serving the first request while the pod is starting.
This functionality relies on performance metrics. From the practical side, it means that autoscaling doesn't happen instantly, because it takes some time to performance metrics reach the configured threshold.
The mentioned Kubernetes feature called HPA(horizontal pod autoscale) is described in this document.
In case you are running your cluster on GCP or GKE, you are able to go further and automatically start additional nodes for your cluster when you need more computing capacity and shut down nodes when they are not running application pods anymore.
More information about this functionality can be found following the link.
Last, but not least, you can use tool like Ansible to manage all your kubernetes assets (it can create/manage deployments via playbooks).
If you decide to give it a try, you might find this information useful:
Creating a Container cluster in GKE
70% cheaper Kubernetes cluster on AWS
How to build a Kubernetes Horizontal Pod Autoscaler using custom metrics

Is it possible to dynamically add a new pod to an existing deployment?

Just to be clear: I'm not asking about scaling up the number of replicas of a pod - I'm asking about adding a new pod which provides completely new functionality.
So I'm wondering: can I call the Kubernetes API to dynamically add a new pod to an existing deployment?
Deployments are meant to be a homogeneous set of replicas of the same pod template, each presumably providing the same functionality. Deployments keep the desired number of replicas running in the event of crashes and other failures, and facilitate rolling updates of the pods when you need to change configuration or the version of the container image, for example. If you want to run a pod that provides different functionality, do so via a different deployment.
Adding a different pod to an existing deployment is not a viable option. If you want to spin up pods in response to API requests to do some work, there are a handful of officially support client libraries you can use in your API business logic: https://kubernetes.io/docs/reference/using-api/client-libraries/#officially-supported-kubernetes-client-libraries.
You can inject a container to an existing Pod. Not sure whether it would meet your requirement.
You can refer to how istio inject a sidecar proxy to an existing Pod manually. Manual injection

How to speedup rolling-updates on GKE

I need to deploy a web application in gke. The application consists of two pods and needs to scale to ~30 replicas.
Rolling updates take ~30s/pod in our setup.
Old title: How do I enable the deployments API on GKE cluster?
I tried to use deployments as they allow to update multiple pods in parallel.
But, as nshttpd pointed out in #google-containers on the kubernetes slack:
I may be wrong, but GKE clusters don’t have beta features I thought. so if you want Deployments you’ll have to spin up your own cluster.
GKE clusters actually do have beta features. But Deployments were an alpha feature in the 1.1 release (which is the current supported release) and are graduating to beta for the upcoming 1.2 release. Once they are a beta feature, you will be able to use them in GKE.
The rolling update command is really just syntactic sugar around first creating a new replication controller, scaling it up by one, scaling the existing replication controller down by one, and repeating until the old replication controller has size zero. You can do this yourself at a much faster rate if going one pod at a time is too slow. You may also want to file a feature request on github to add a flag to the rolling update command to update multiple pods in parallel.