Dynamic scaling for statefulset best practices - mongodb

Background
I have app running in kubernetes cluster using sharded mongodb and elasticsearch statefulsets. I setup horizontal pod autoscalers for deployment components in my app and everything works well.
Problems
Problems arise when the traffic goes up. My server deployment scales out just fine, but mongodb shards and elasticsearch nodes cannot handle this much traffic and throttle overall response time.
Simple solution is to configure those statefulset with more shards, more replicas. What bugs me is that traffic spike happens like 3-4 hours a day, thus it's kinda wasteful to let all those boys sitting idly for the rest of the day.
I did some research and looks like database in general is not supposed to scale out/in dynamically as it will consume a lot of network and disk io just to do replication between them. Also there is potential of data loss and inconsistency during scaling up, scaling down.
Questions
If possible, what is proper way to handle dynamic scaling in mongodb, elasticsearch... and database in general?
If not, what can I do to save some cents off my cloud bill as we only need the maximum power from database pods for a short period per day.

You should read about Kubernetes autoscaling - HPA.
The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can't be scaled, for example, DaemonSets.
The Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed average CPU utilization to the target specified by user.
With HPA you should have to also take care about the volume mounting and data latency.
As #Serge mentioned in comments, I would suggest to check the native scaling cluster option provided by the MongoDB and Elasticsearch itself.
Take a look at
MongoDB operator documentation
Elasticsearch operator documentation
Elasticsearch future release autoscaling
I am not very familiar with MongoDB and Elasticsearch with Kubernetes, but maybe those tutorials help you:
https://medium.com/faun/scaling-mongodb-on-kubernetes-32e446c16b82
https://www.youtube.com/watch?v=J7h0F34iBx0
https://kubernetes.io/blog/2017/01/running-mongodb-on-kubernetes-with-statefulsets/
https://sematext.com/blog/elasticsearch-operator-on-kubernetes/#toc-what-is-the-elasticsearch-operator-1
If you use helm take a look at banzaicloud Horizontal Pod Autoscaler operator
You may not want nor can edit a Helm chart just to add an autoscaling feature. Nearly all charts supports custom annotations so we believe that it would be a good idea to be able to setup autoscaling just by adding some simple annotations to your deployment.
We have open sourced a Horizontal Pod Autoscaler operator. This operator watches for your Deployment or StatefulSet and automatically creates an HorizontalPodAutoscaler resource, should you provide the correct autoscale annotations.
Hope you find this useful.

Related

Scaling a Kubernetes cluster to process jobs in a queue?

(I'm new to Kubernetes and not sure this is best practice)
I have a pipeline of jobs in a Firestore database that need to be completed as quickly as possible.
I want to set up a Kubernetes cluster (on GKE) that will scale up when there is a large backlog of tasks to complete. Each pod/node needs a single GPU to complete the task.
Is it possible to use a cloud function to manually scale the number of nodes depending on the number of jobs in the pipeline?
I was planning on using the clusters.nodePools.setSize() function from the GKE client library but I'm not sure if this is just intended for initial cluster setup rather than manual scaling.
Thanks
https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/projects.locations.clusters.nodePools/setSize
You can configure and use Horizontal pod scaling on your cluster to scale the number of Pods .
As suggested by #somethingsomething refer these links on Horizontal Pod Autoscaler and Cluster autoscaler :
The Horizontal Pod Autoscaler changes the shape of your Kubernetes workload by automatically increasing or decreasing the number of Pods in response to the workload's CPU or memory consumption, or in response to custom metrics reported from within Kubernetes or external metrics from sources outside of your cluster.
Horizontal Pod autoscaling helps to ensure that your workload functions consistently in different situations, and allows you to control costs by only paying for extra capacity when you need it.
It's not always easy to predict the indicators that show whether your workload is under-resourced or under-utilized. The Horizontal Pod Autoscaler can automatically scale the number of Pods in your workload based on one or more metrics.

Is it possible to use Kubernetes autoscale on cron job pods

Some context: I have multiple cron jobs running daily, weekly, hourly and some of which require significant processing power.
I would like to add requests and limitations to these container cron pods to try and enable vertical scaling and ensure that the assigned node will have enough capacity when being initialized. This will prevent me from having to have multiple large node available at all times and also letting me modify how many crons I can run in parallel easily.
I would like to try and avoid timed scaling since the cron jobs processing time can increase as the application grows.
Edit - Additional Information :
Currently I am using Digital Ocean and utilizing it's UI for cluster autoscaling. I have it working with HPA's on deployments but not crons. Adding limits to crons does not trigger cluster autoscaling to my knowledge.
I have tried to enable HPA scaling with the cron but with no success. Basically it just sits on a pending status signalling that there is insufficient CPU available and does not generate a new node.
Does HPA scaling work with cron job pods and is there a way to achieve the same type of scaling?
HPA is used to scale more pods when pod loads are high, but this won't increase the resources on your cluster.
I think you're looking for cluster autoscaler (works on AWS, GKE and Azure) and will increase cluster capacity when pods can't be scheduled.
This is a Community Wiki answer so feel free to edit it and add any additional details you consider important.
As Dom already mentioned "this won't increase the resources on your cluster." Saying more specifically, it won't create an additional node as Horizontal Pod Autoscaler doesn't have such capability and in fact it has nothing to do with cluster scaling. It's name is pretty self-exlpanatory. HPA is able only to scale Pods and it scales them horizontaly, in other words it can automatically increase or decrease number of replicas of your "replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics)" as per the docs.
As to cluster autoscaling, as already said by Dom, such solutions are implemented in so called managed kubernetes solutions such as GKE on GCP, EKS on AWS or AKS on Azure and many more. You typically don't need to do anything to enable them as they are available out of the box.
You may wonder how HPA and CA fit together. It's really well explained in FAQ section of the Cluster Autoscaler project:
How does Horizontal Pod Autoscaler work with Cluster Autoscaler?
Horizontal Pod Autoscaler changes the deployment's or replicaset's
number of replicas based on the current CPU load. If the load
increases, HPA will create new replicas, for which there may or may
not be enough space in the cluster. If there are not enough resources,
CA will try to bring up some nodes, so that the HPA-created pods have
a place to run. If the load decreases, HPA will stop some of the
replicas. As a result, some nodes may become underutilized or
completely empty, and then CA will terminate such unneeded nodes.

Container CPU request value for Kubernetes

I have several microservices and I am deploying them to GCP Kubernetes. I am using the free credits and trying out my deployments. My question is when we define CPU requests, based on what we define it? I have set it to 250mCPU but that fills my cluster nodes which are small in with CPU.
Currently, I have 3 nodes with 940mCPU allocable CPU and 3 nodes of the same kind. Now I deployed one API with 3 replicas and assigned 250mCPU for each. With all Kubernetes internal items, all nodes are almost full.
So my question is, based on what we can assign a value for CPU for a service. 250mCPU was a random value. What are others doing to find a minimum CPU for Kubernetes? I have one ASP.NET Core API and 8 NodeJS APIs. If it's based on usage, what's' the best values to start with for a new product?
Mostly by doing stress testing on your application, also, you could use a vertical pod autoscaler in "recomendation" mode, which will monitor your application for some time and then make a recomendation for you to set limits.
Documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler
Remember you cannot use a vertical pod autoscaler and horizontal pod autoscaler at the same time, unless the vertical pod autoscaler its in recomendation mode.

Duplicate metrics with multiple instances of kube-state-metrics

Problem:
Duplicate data when querying from prometheus for metrics from kube-state-metrics.
Sample query and result with 3 instances of kube-state-metrics running:
Query:
kube_pod_container_resource_requests_cpu_cores{namespace="ns-dummy"}
Metrics
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.142:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-34-25.ec2.internal",pod="app1-appname-6bd9d8d978-gfk7f",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.142:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-35-22.ec2.internal",pod="app2-appname-ccbdfc7c8-g9x6s",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.17:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-34-25.ec2.internal",pod="app1-appname-6bd9d8d978-gfk7f",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.17:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-35-22.ec2.internal",pod="app2-appname-ccbdfc7c8-g9x6s",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.37.171:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-34-25.ec2.internal",pod="app1-appname-6bd9d8d978-gfk7f",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.37.171:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-35-22.ec2.internal",pod="app2-appname-ccbdfc7c8-g9x6s",service="prom-kube-state-metrics"}
Observation:
Every metric is coming up Nx when N pods are running for kube-state-metrics. If it's a single pod running, we get the correct info.
Possible solutions:
Scale down to single instance of kube-state-metrics. (Reduced availability is a concern)
Enable sharding. (Solves duplication problem, still less available)
According to the docs, for horizontal scaling we have to pass sharding arguments to the pods.
Shards are zero indexed. So we have to pass the index and total number of shards for each pod.
We are using Helm chart and it is deployed as a deployment.
Questions:
How can we pass different arguments to different pods in this scenario, if its possible?
Should we be worried about availability of the kube-state-metrics considering the self-healing nature of k8s workloads?
When should we really scale it to multiple instances and how?
You could use a 'self-healing' deployment with only a single replica of kube-state-metric if the container down, the deployment will start a new container. Since kube-state-metric is not focused on the health of the individual kubernetes components. It only will affect you if your cluster is too big and make many objects changes per second.
It is not focused on the health of the individual Kubernetes components, but rather on the health of the various objects inside, such as deployments, nodes and pods.
For small cluster there's is no problem use in this way, but you really need a high availability monitoring platform I recommend you take a look in this two articles:
creating a well designed and highly available monitoring stack for kubernetes and
kubernetes monitoring

What is needed to enable a pod to control another deployment in kubernetes?

I'm trying to figure out the pieces and how to fit them together for having a pod be able to control aspects of a deployment, like scaling. I'm thinking I need to set up a service account for it, but I'm not finding the information on how to link it all together, and then how to get the pod to use the service account. I'll be writing this in python, which might add to the complexity of how to use the service account
Try to set up Horizontal Pod Autpscaler.
The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Horizontal Pod Autoscaling does not apply to objects that can’t be scaled, for example, DaemonSets.
The Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed average CPU utilization to the target specified by user.
Documentations: hpa-setup, autoscaling.