I have deployed Hashicorp's Vault in my Kubernetes clusters on AWS using the Helm chart for it.
The number of replicas in the deployment is specified as 3.
Out of these 3 pods, 1 was ready(1/1) while the other two replica pods were not ready(0/1).
I killed the ready pod and while it was expected that Kubernetes will deploy a new pod to replace it, it deployed two new pods.
Now I have two ready pods and two not ready pods. On deleting one of these pods, now Kubernetes recreates only one pod. So I have 4 instead of 3 pods for my vault deployment.
What could be the reason behind this and how can we prevent this?
Your deployment is not working because HA (high availability) is not available when using the s3 storage backend. You’ll need Hashicorp’s Consul or AWS’s DynamoDB, or a different backend provider for that. Change the number of replicas to 1 if you're sticking with the s3 backend provider.
As far as why your seeing 4 pods instead of 3, you need to provide more details. Paste the output of kubectl get pods -l app=vault as well as kubectl describe deploy -l app=vault and I will update this answer.
I can only offer speculation for what it's worth. With Deployment objects there's a maxSurge property that allows rolling updates to scale up beyond the desired number of replicas. It defaults to 25%, rounded up, which in your case would be an additional 1 pod.
Max Surge
.spec.strategy.rollingUpdate.maxSurge is an optional field that
specifies the maximum number of Pods that can be created over the
desired number of Pods. The value can be an absolute number (for
example, 5) or a percentage of desired Pods (for example, 10%). The
value cannot be 0 if MaxUnavailable is 0. The absolute number is
calculated from the percentage by rounding up. The default value is
25%.
For example, when this value is set to 30%, the new ReplicaSet can be
scaled up immediately when the rolling update starts, such that the
total number of old and new Pods does not exceed 130% of desired Pods.
Once old Pods have been killed, the new ReplicaSet can be scaled up
further, ensuring that the total number of Pods running at any time
during the update is at most 130% of desired Pods.
It's possible that deleting the one Running (1/1) pod, along with the NotReady state of the other pods, put your Deployment into a state of "rolling update" or something along those lines, which allowed your deployment to scale up to its maxSurge setting.
When you have such problems, you should
kubectl describe pod <PROBLEMATIC_POD>
and see the the lower part of your output Events.
Some reasons for your pods not starting may be:
no available nodes with enough resources for your requests
no available volumes
some antiaffinity rules and not enough nodes, so the scheduler cannot assign nodes to your pods.
Related
I have a terraform-managed EKS cluster. It used to have 2 nodes on it. I doubled the number of nodes (4).
I have a kubernetes_deployment resource that automatically deploys a fixed number of pods to the cluster. It was set to 20 when I had 2 nodes, and seemed evenly distributed with 10 each. I doubled that number to 40.
All of the new pods for the kubernetes deployment are being scheduled on the first 2 (original) nodes. Now the two original nodes have 20 pods each, while the 2 new nodes have 0 pods. The new nodes are up and ready to go, but I cannot get kubernetes to schedule the new pods on those new nodes.
I am unsure where to even begin searching, as I am fairly new to k8s and ops in general.
A few beginner questions that may be related:
I'm reading about pod affinity, and it seems like I could tell k8s to have a pod ANTI affinity with itself within a deployment. However, I am having trouble setting up the anti-affinity rules. I see that the kubernetes_deployment resource has a scheduling argument, but I can't seem to get the syntax right.
Naively it seems that the issue may be that the deployment somehow isn't aware of the new nodes. If that is the case, how could I reboot the entire deployment (without taking down the already-running pods)?
Is there a cluster level scheduler that I need to set? I was under the impression that the default does round robin, which doesn't seem to be happening at the node level.
EDIT:
The EKS terraform module node_groups submodule has fields for desired/min/max_capacity. To increase my worker nodes, I just increased those numbers. The change is reflected in the aws eks console.
Check a couple of things:
Do your nodes show up correctly in the output of kubectl get nodes -o wide and do they have a state of ready?
Instead of pod affinity look into pod topology spread constraints. Anti affinity will not work with multiple pods.
I was wondering what added value gives the PodDisruptionBudget.
As far as I understand, PodDisruptionBudget promises that a certain amount of nodes will always remain in the cluster while there are 2 options to decide how: minAvailable / maxUnavailable.
Now, when I define ReplicaSet I define how many replicas I want. So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
PodDisruptionBudget helps in ensuring zero downtime for an application which ReplicaSet can't guarantee.
The following post explains with an example how PodDisruptionBudget can be useful in achieving zero downtime for an application:
Quoting the post, the node upgrade is a normal scenario as described in:
Let’s consider a scenario, we need to upgrade version of node or
update the spec often. Cluster downscaling is also a normal condition.
In these cases, the pods running on the to-be-deleted nodes needs to
be drained.
kubectl drain is performed on one of the nodes for the upgrade:
We need to remove node1 from the pool which we cannot do it by
detaching instantly as that will lead to termination of all the pods
running in there which can get services down. First step before
detaching node is to make the node unscheduled.
Running kubectl get pods -w will show the pods running on the node get in termination state which leads to a downtime:
If you quickly check the pods with kubectl get pods , it will
terminate all the running pods instantly which were scheduled on node1
. This could lead a downtime! If you are running few number of pods
and all of them are scheduled on same node, it will take some time for
the pods to be scheduled on other node.
PodDisruptionBudget with minAvailable are useful in such scenarios to achieve zero downtime. Replicaset will only ensure that the replicas number of pods will be
created on other nodes during the process.
If you just have a Replicaset with one replica and no PodDisruptionBudget specified, the pod will be terminated and a new pod will be created on other nodes. This is where PDBs provide the added advantage over the Replicaset.
For the PodDisruptionBudget to work, there must be at least 2 pods
running for a label selector otherwise, the node cannot be drained
gracefully and it will be evicted forcefully when grace time ends.
Then what gives the PodDisruptionBudget?
If you have an application where you want high availability e.g. it may take time to rebuild a cache after each crash.
There are both voluntary and involuntary disruptions. PodDisruptionBudget can limit the latter but both counts against the budget.
An example of voluntary disruption is when an employee of your platform team decide to upgrade the kernel for all your nodes - sometimes you want to do this slowly since all Pods on the node will be terminated and scheduled to a different node.
There is also involuntary disruptions e.g. a disk crash on one of your nodes.
So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
It's 2 for minAvailable. And maxAvailable is a wrong name , it's maxUnavailable.
I want to access the number of replicas and also the current replica id for a given pod, from inside the pod itself.
For example, if there are 3 replicas of any given pod, say foo_A, foo_B and foo_C, created in that specific order, is it possible to have total number of replicas and index of pod within the replica set to be available within the pod ?
Also I understand that, with old pods getting killed and new ones coming up, index of pod within replica set can dynamically change.
I know this can be achieved using Downward API, but which fields to access ?
Could anyone please help ?
As mentioned in comments, you can use StatefulSets:
StatefulSet Pods have a unique identity that is comprised of an ordinal, a stable network identity, and stable storage. The identity sticks to the Pod, regardless of which node it's (re)scheduled on.
As you can see here your pods will be created in a ordinal sequence:
For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
Before a Pod is terminated, all of its successors must be completely shutdown.
In a project, I'm enabling the cluster autoscaler functionality from Kubernetes.
According to the documentation: How does scale down work, I understand that when a node is used for a given time less than 50% of its capacity, then it is removed, together with all of its pods, which will be replicated in a different node if needed.
But the following problem can happen: what if all the pods related to a specific deployment are contained in a node that is being removed? That would mean users might experience downtime for the application of this deployment.
Is there a way to avoid that the scale down deletes a node whenever there is a deployment which only contains pods running on that node?
I have checked the documentation, and one possible (but not good) solution, is to add an annotation to all of the pods containing applications here, but this clearly would not down scale the cluster in an optimal way.
In the same documentation:
What happens when a non-empty node is terminated? As mentioned above, all pods should be migrated elsewhere. Cluster Autoscaler does this by evicting them and tainting the node, so they aren't scheduled there again.
What is the Eviction ?:
The eviction subresource of a pod can be thought of as a kind of policy-controlled DELETE operation on the pod itself.
Ok, but what if all pods get evicted at the same time on the node?
You can use Pod Disruption Budget to make sure minimum replicas are always working:
What is PDB?:
A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
In k8s docs you can also read:
A PodDisruptionBudget has three fields:
A label selector .spec.selector to specify the set of pods to which it applies. This field is required.
.spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
.spec.maxUnavailable (available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
So if you use PDB for your deployment it should not get deleted all at once.
But please notice that if the node fails for some other reason (e.g hardware failure), you will still experience downtime. If you really care about High Availability consider using pod antiaffinity to make sure the pods are not scheduled all on one node.
Same document you referred to, has this:
How is Cluster Autoscaler different from CPU-usage-based node autoscalers? Cluster Autoscaler makes sure that all pods in the
cluster have a place to run, no matter if there is any CPU load or
not. Moreover, it tries to ensure that there are no unneeded nodes in
the cluster.
CPU-usage-based (or any metric-based) cluster/node group autoscalers
don't care about pods when scaling up and down. As a result, they may
add a node that will not have any pods, or remove a node that has some
system-critical pods on it, like kube-dns. Usage of these autoscalers
with Kubernetes is discouraged.
If I give some specific label to pods and define replicaset saying to include pods with same labels, it includes that pod in it. That is all fine and good..
( i know pods are not to be created separately, but are supposed to be created with deployments or replicaset.. but still how deployments/replicasets include pods whose label match in the defination, if they are already there for some reason)
BUT, how does this work behind the scene ? How replicaset knows that pod is to be included as it has the same label ? Lets say, I already have a pod with those labels, how does newly created replica set know that pod is to be included if it has pods less than desired number of pods ?
Does it get that information from etcd ? Or pods expose labels somehow ? How does this thing work really behind the scene ?
As stated in the Kubernetes documentation regarding ReplicaSet.
A ReplicaSet is defined with fields, including a selector that specifies how to identify Pods it can acquire, a number of replicas indicating how many Pods it should be maintaining, and a pod template specifying the data of new Pods it should create to meet the number of replicas criteria. A ReplicaSet then fulfills its purpose by creating and deleting Pods as needed to reach the desired number. When a ReplicaSet needs to create new Pods, it uses its Pod template.
It's recommended to use Deployments instead of ReplicaSets.
Deployment is an object which can own ReplicaSets and update them and their Pods via declarative, server-side rolling updates. While ReplicaSets can be used independently, today they’re mainly used by Deployments as a mechanism to orchestrate Pod creation, deletion and updates. When you use Deployments you don’t have to worry about managing the ReplicaSets that they create. Deployments own and manage their ReplicaSets. As such, it is recommended to use Deployments when you want ReplicaSets.
Like you mentioned if you have a Pod with label matching the ReplicaSet label, ReplicaSet will take control over the pod. If you deploy ReplicaSet with 3 replicas and Pod was deployed before that, then RS will spawn only 2 Pods with the matching label. It's explained with details and examples on Non-Template Pod acquisitions.
As to how it works behind the scenes, you can have a look at slides #47-56 of Kubernetes Architecture - beyond a black box - Part 1