Kubernetes: Evenly distribute the replicas across the cluster - kubernetes

We can use DaemonSet object to deploy one replica on each node. How can we deploy say 2 replicas or 3 replicas per node? How can we achieve that. please let us know

There is no way to force x pods per node the way a Daemonset does. However, with some planning, you can force a fairly even pod distribution across your nodes using pod anti affinity.
Let's say we have 10 nodes. The first thing is we need to have a ReplicaSet (deployment) with 30 pods (3 per node). Next, we want to set the pod anti affinity to use preferredDuringSchedulingIgnoredDuringExecution with a relatively high weight and match the deployment's labels. This will cause the scheduler to prefer not scheduling pods where the same pod already exists. Once there is 1 pod per node, the cycle starts over again. A node with 2 pods will be weighted lower than one with 1 pod so the next pod should try to go there.
Note this is not as precise as a DaemonSet and may run into some limitations when it comes time to scale up or down the cluster.
A more reliable way if scaling the cluster is to simply create multiple DaemonSets with different names, but identical in every other way. Since the DaemonSets will have the same labels, they can all be exposed through the same service.

By default, the kubernetes scheduler will prefer to schedule pods on different nodes.
The kubernetes scheduler will first determine all possible nodes where a pod can be deployed based on your affinity/anti-affinity/resource limits/etc.
Afterward, the scheduler will find the best node where the pod can be deployed. The scheduler will automatically schedule the pods to be on separate availability zones and on separate nodes if this is possible of course.
You can try this on your own. For example, if you have 3 nodes, try deploying 9 replicas of a pod. You will see that each node will have 2 pods running.

Related

Kubernetes StatefulSets - run pod on every worker node

What is the easiest way to run a single Pod on every available worker node as part of the StatefulSet. So, a one to one mapping.
Am I right to say every Pod will run on a different Node by default with a StatefulSet? In which case is it sufficient to add x pods to the SS where x Worker nodes exist in the cluster?
Thanks.
Use DaemonSet instead.
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
If you really want to use statefulSet, you can take a look at features like nodeSelector or Affinity and Anti-affinity.

Kubernetes Statefulset Downscaling

Currently I am running a solr cluster on Kubernetes as a statefulset. My solr cluster has 39 pods running in it. I am running a single pod on a single physical node. My solr cluster has just 1 collection divived into 3 shards, each shard has 13 nodes (or pods) running in it and out of those 13 nodes (or pods), 3 are TLOG replicas and 10 are PULL replicas.
The problem that I want to disucss is - I want to autoscale my solr cluster. On the basis of some condition I want to downscale my PULL replica nodes (or pods) to minimum, so that unnecessary consumption can be reduced. Now I know I can use HPA in Kuberntes to autoscale, but while downscaling I don't want to stop my TLOG nodes (or pods). Similarly, while scaling up I want to just add PULL replicas to my cluster.
Can anyone please help me with this problem.
You can have different deployments for each one of the pod types, e.g one Deployment for TLOG pods and another one for PULL pods. Then you can define a fixed number of replicas for the TLOG pods and an HPA for the PULL pods. This will allow for adding / removing PULL pods only, without any impact on the TLOG pods.

Affinity - Only run x number of pods per node in Kubernetes?

I can only find documentation online for attaching pods to nodes based on labels.
Is there a way to attach pods to nodes based on labels and count - So only x pods with label y?
Our scenario is that we only want to run 3 of our API pods per node.
If a 4th API pod is created, it should be scheduled onto a different node with less than 3 API pods running currently.
Thanks
No, you can not schedule by count of a specific label. But you can avoid co-locate your pods on the same node.
Avoid co-locate your pods on same node
You can use podAntiAffinity and topologyKey and taints to avoid scheduling pods on the same node. See Never co-located in the same node

How to convert Daemonsets to kind Deployment

I have already deployed pods using Daemonsets with nodeselector. My requirements is I need to use kind Deployment but at the same time I would want to retain Daemonsets functionality
.I have nodeselector defined so that same pod should be installed in labelled node.
How to achieve your help is appreciated.
My requirements is pod should be placed automatically based on nodeselector but with kind Deployment
In otherwords
Using Replication controller when I schedule 2 (two) replicas of a pod I expect 1 (one) replica each in each Nodes (VMs). Instead I find both replicas are created in same node This will make 1 Node a single point of failure which I need to avoid.
I have labelled two nodes properly. And I could see both pods spawned on single node. How to achieve both pods always schedule on both nodes?
Look into affinity and anti-affinity, specifically, inter-pod affinity and anti-affinity.
From official documentation:
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. The rules are of the form “this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y”.

Does HorizontalPodAutoscaler make sense when there is only one Deployment on GKE (Google Container Engine) Kubernetes cluster?

I have a "homogeneous" Kubernetes setup. By this I mean that I am only running instances of a single type of pod (an http server) with a load balancer service distributing traffic to them.
By my reasoning, to get the most out of my cluster (edit: to be concrete -- getting the best average response times to http requests) I should have:
At least one pod running on every node: Not having a pod running on a node, means that I am paying for the node and not having it ready to serve a request.
At most one pod running on every node: The pods are threaded http servers so they can maximize utilization of a node, so running multiple pods on a node does not net me anything.
This means that I should have exactly one pod per node. I achieve this using a DaemonSet.
The alternative way is to configure a Deployment and apply a HorizontalPodAutoscaler to it and have Kubernetes handle the number of pods and pod to node mapping. Is there any disadvantage of my approach in comparison to this?
My evaluation is that the HorizontalPodAutoscaler is relevant mainly in heterogeneous situations, where one HorizontalPodAutoscaler can scale up a Deployment at the expense of another Deployment. But since I have only one type of pod, I would have only one Deployment and I would be scaling up that deployment at the expense of itself, which does not make sense.
HorizontalPodAutoscaler is actually a valid solution for your needs. To address your two concerns:
1. At least one pod running on every node
This isn't your real concern. The concern is underutilizing your cluster. However, you can be underutilizing your cluster even if you have a pod running on every node. Consider a three-node cluster:
Scenario A: pod running on each node, 10% CPU usage per node
Scenario B: pod running on only one node, 70% CPU usage
Even though Scenario A has a pod on each node the cluster is actually being less utilized than in Scenario B where only one node has a pod.
2. At most one pod running on every node
The Kubernetes scheduler tries to spread pods around so that you don't end up with multiple pods of the same type on a single node. Since in your case the other nodes should be empty, the scheduler should have no problems starting the pods on the other nodes. Additionally, if you have the pod request resources equivalent to the node's resources, that will prevent the scheduler from scheduling a new pod on a node that already has one.
Now, you can achieve the same effect whether you go with DaemonSet or HPA, but I personally would go with HPA since I think it fits your semantics better, and would also work much better if you eventually decide to add other types of pods to your cluster
Using a DamonSet means that the pod has to run on every node (or some subset). This is a great fit for something like a logger or a metrics collector which is per-node. But you really just want to use available cluster resources to power your pod as needed, which matches up better with the intent of HPA.
As an aside, I believe GKE supports cluster autoscaling, so you should never be paying for nodes that aren't needed.