Kubernetes Statefulset Downscaling - kubernetes

Currently I am running a solr cluster on Kubernetes as a statefulset. My solr cluster has 39 pods running in it. I am running a single pod on a single physical node. My solr cluster has just 1 collection divived into 3 shards, each shard has 13 nodes (or pods) running in it and out of those 13 nodes (or pods), 3 are TLOG replicas and 10 are PULL replicas.
The problem that I want to disucss is - I want to autoscale my solr cluster. On the basis of some condition I want to downscale my PULL replica nodes (or pods) to minimum, so that unnecessary consumption can be reduced. Now I know I can use HPA in Kuberntes to autoscale, but while downscaling I don't want to stop my TLOG nodes (or pods). Similarly, while scaling up I want to just add PULL replicas to my cluster.
Can anyone please help me with this problem.

You can have different deployments for each one of the pod types, e.g one Deployment for TLOG pods and another one for PULL pods. Then you can define a fixed number of replicas for the TLOG pods and an HPA for the PULL pods. This will allow for adding / removing PULL pods only, without any impact on the TLOG pods.

Related

GKE node pool with Autoscaling does not scale down

I have a GKE cluster with two nodepools. I turned on autoscaling on one of my nodepools but it does not seem to automatically scale down.
I have enabled HPA and that works fine. It scales the pods down to 1 when I don't see traffic.
The API is currently not getting any traffic so I would expect the nodes to scale down as well.
But it still runs the maximum 5 nodes despite some nodes using less than 50% of allocatable memory/CPU.
What did I miss here? I am planning to move these pods to bigger machines but to do that I need the node autoscaling to work to control the monthly cost.
There are many reasons that can cause CA to not be downscaling successfully. If we resume how this should work normally it will be something like this:
Cluster autoscaler will periodically check (every 10 seconds) utilization of the nodes.
If the utilization factor is less than 0.5 the node will be considered as under utilization.
Then the nodes will be marked for removal and will be monitored for next 10 mins to make sure the utilization factor stays less than 0.5.
If even after 10 mins it stays under utilized then the node would be removed by cluster autoscaler.
If above is not being accomplished, then something else is preventing your nodes to be downscaling. In my experience PDBs needs to be applied to kube-system pods and I would say that could be the reason why; however, there are many reasons why this can be happening, here are reasons that can cause downscaling issues:
1. PDB is not applied to your kube-system pods. Kube-system pods prevent Cluster Autoscaler from removing nodes on which they are running. You can manually add Pod Disruption Budget(PDBs) for the kube-system pods that can be safely rescheduled elsewhere, this can be added with next command:
`kubectl create poddisruptionbudget PDB-NAME --namespace=kube-system --selector app=APP-NAME --max-unavailable 1`
2. Containers using local storage (volumes), even empty volumes. Kubernetes prevents scale down events on nodes with pods using local storage. Look for this kind of configuration that prevents Cluster Autoscaler to scale down nodes.
3. Pods annotated with cluster-autoscaler.kubernetes.io/safe-to-evict: true. Look for pods with this annotation that can be preventing Nodes scaledown
4. Nodes annotated with cluster-autoscaler.kubernetes.io/scale-down-disabled: true. Look for Nodes with this annotation that can be preventing cluster Autoscale. These configurations are the ones I will suggest you check on, in order to make your cluster to be scaling down nodes that are under utilized. -----
Also you can see this page where explains the configuration to prevent the downscales, which can be what is happening to you.

Affinity - Only run x number of pods per node in Kubernetes?

I can only find documentation online for attaching pods to nodes based on labels.
Is there a way to attach pods to nodes based on labels and count - So only x pods with label y?
Our scenario is that we only want to run 3 of our API pods per node.
If a 4th API pod is created, it should be scheduled onto a different node with less than 3 API pods running currently.
Thanks
No, you can not schedule by count of a specific label. But you can avoid co-locate your pods on the same node.
Avoid co-locate your pods on same node
You can use podAntiAffinity and topologyKey and taints to avoid scheduling pods on the same node. See Never co-located in the same node

How many pods can be configured per deployment in kubernetes?

As per the Kubernetes documentation there is 1:1 correspondence between Deployment and ReplicaSets. Similarly depending on the replicas attribute , a ReplicaSet can manage n number of pods of same nature. Is this a correct understanding ?
Logically (assuming Deployment is a wrapper/Controller) I feel Deployment can have multiple replicaSets and each replicaSet can have multiple Pods (same or different kind). If this statement is correct, can some one share an example K8S template ?
1.) Yes, a Deployment is a ReplicaSet, managed at a higher level.
2.) No, a Deployment can not have multiple ReplicaSets, a Deployment pretty much IS a ReplicaSet. Typically you never use a ReplicaSet directly, Deployment is all you need. And no, you can't have different Pod templates in one Deployment or ReplicaSet. The point of replication is to create copies of the same thing.
As to how many pods can be run per Deployment, the limits aren't really per Deployment, unless specified. Typically you'd either set the wanted number of replicas in the Deployment or you use the Horizontal Pod Autoscaler with a minimum and a maximum number of Pods. And unless Node limits are smaller, the following limits apply:
No more than 100 pods per node
No more than 150000 total pods
https://kubernetes.io/docs/setup/best-practices/cluster-large/
As per the Kubernetes documentation there is 1:1 correspondence between Deployment and ReplicaSets. Similarly depending on the replicas attribute , a ReplicaSet can manage n number of pods of same nature. Is this a correct understanding ?
Yes. It will create no of pods equal to value to the replicas field value.
Deployment manages a replica set, you don't/shouldn't interact with the replica set directly.
Logically (assuming Deployment is a wrapper/Controller) I feel Deployment can have multiple replicaSets and each replicaSet can have multiple Pods (same or different kind). If this statement is correct, can some one share an example K8S template ?
When you do a rolling deployment, it creates a new ReplicaSet with the new pods (updated containers), and scales down the pods running in older replica set.
I guess it does not support running two different ReplicaSets(not deployment updates) with different pod/containers.
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#updating-a-deployment
After the deployment has been updated:
Run:
kubectl describe deployments
Output:
.
.
.
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-1564180365 (3/3 replicas created)

Kubernetes: Evenly distribute the replicas across the cluster

We can use DaemonSet object to deploy one replica on each node. How can we deploy say 2 replicas or 3 replicas per node? How can we achieve that. please let us know
There is no way to force x pods per node the way a Daemonset does. However, with some planning, you can force a fairly even pod distribution across your nodes using pod anti affinity.
Let's say we have 10 nodes. The first thing is we need to have a ReplicaSet (deployment) with 30 pods (3 per node). Next, we want to set the pod anti affinity to use preferredDuringSchedulingIgnoredDuringExecution with a relatively high weight and match the deployment's labels. This will cause the scheduler to prefer not scheduling pods where the same pod already exists. Once there is 1 pod per node, the cycle starts over again. A node with 2 pods will be weighted lower than one with 1 pod so the next pod should try to go there.
Note this is not as precise as a DaemonSet and may run into some limitations when it comes time to scale up or down the cluster.
A more reliable way if scaling the cluster is to simply create multiple DaemonSets with different names, but identical in every other way. Since the DaemonSets will have the same labels, they can all be exposed through the same service.
By default, the kubernetes scheduler will prefer to schedule pods on different nodes.
The kubernetes scheduler will first determine all possible nodes where a pod can be deployed based on your affinity/anti-affinity/resource limits/etc.
Afterward, the scheduler will find the best node where the pod can be deployed. The scheduler will automatically schedule the pods to be on separate availability zones and on separate nodes if this is possible of course.
You can try this on your own. For example, if you have 3 nodes, try deploying 9 replicas of a pod. You will see that each node will have 2 pods running.

Resizing a google cloud Kubernetes cluster to zero not working

I try to resize a kubernetes cluster to zero nodes using
gcloud container clusters resize $CLUSTER_NAME --size=0 --zone $ZONE
I get a success message but the size of the node-pool remains the same (I use only one node pool)
Is it possible to resize the cluster to zero?
Sometimes you just need to wait 10-20 minutes before autoscale operation takes effect.
In other cases, you may need to check if some conditions are met for downscaling the node.
According to autoscaler documentation:
Cluster autoscaler also measures the usage of each node against the node pool's total demand for capacity. If a node has had no new Pods scheduled on it for a set period of time, and all Pods running on that node can be scheduled onto other nodes in the pool, the autoscaler moves the Pods and deletes the node.
Note that cluster autoscaler works based on Pod resource requests, that is, how many resources your Pods have requested. Cluster autoscaler does not take into account the resources your Pods are actively using. Essentially, cluster autoscaler trusts that the Pod resource requests you've provided are accurate and schedules Pods on nodes based on that assumption.
Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods)
Cluster autoscaler has following limitations:
- When scaling down, cluster autoscaler supports a graceful termination period for a Pod of up to 10 minutes. A Pod is always killed after a maximum of 10 minutes, even if the Pod is configured with a higher grace period.
Note: Every change you make to the cluster autoscaler causes the Kubernetes master to restart, which takes several minutes to complete.
However, there are cases mentioned in FAQ that can prevent CA from removing a node:
What types of pods can prevent CA from removing a node?
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default, *
don't have PDB or their PDB is too restrictive (since CA 0.6).
Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
Pods with local storage. *
Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, etc)
*Unless the pod has the following annotation (supported in CA 1.0.3 or later):
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
How can I scale my cluster to just 1 node?
Prior to version 0.6, Cluster Autoscaler was not touching nodes that were running important kube-system pods like DNS, Heapster, > Dashboard etc. If these pods landed on different nodes, CA could not scale the cluster down and the user could end up with a completely empty 3 node cluster. In 0.6, we added an option to tell CA that some system pods can be moved around. If the user configures a PodDisruptionBudget for the kube-system pod, then the default strategy of not touching the node running this pod is overridden with PDB settings. So, to enable kube-system pods migration, one should set minAvailable to 0 (or <= N if there are N+1 pod replicas.) See also I have a couple of nodes with low utilization, but they are not scaled down. Why?
How can I scale a node group to 0?
From CA 0.6 for GCE/GKE and CA 0.6.1 for AWS, it is possible to scale a node group to 0 (and obviously from 0), assuming that all scale-down conditions are met.
For AWS, if you are using nodeSelector, you need to tag the ASG with a node-template key "k8s.io/cluster-autoscaler/node-template/label/".
For example, for a node label of foo=bar, you would tag the ASG with:
{
"ResourceType": "auto-scaling-group",
"ResourceId": "foo.example.com",
"PropagateAtLaunch": true,
"Value": "bar",
"Key": "k8s.io/cluster-autoscaler/node-template/label/foo"
}