I want to access the number of replicas and also the current replica id for a given pod, from inside the pod itself.
For example, if there are 3 replicas of any given pod, say foo_A, foo_B and foo_C, created in that specific order, is it possible to have total number of replicas and index of pod within the replica set to be available within the pod ?
Also I understand that, with old pods getting killed and new ones coming up, index of pod within replica set can dynamically change.
I know this can be achieved using Downward API, but which fields to access ?
Could anyone please help ?
As mentioned in comments, you can use StatefulSets:
StatefulSet Pods have a unique identity that is comprised of an ordinal, a stable network identity, and stable storage. The identity sticks to the Pod, regardless of which node it's (re)scheduled on.
As you can see here your pods will be created in a ordinal sequence:
For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
Before a Pod is terminated, all of its successors must be completely shutdown.
Related
I was wondering what added value gives the PodDisruptionBudget.
As far as I understand, PodDisruptionBudget promises that a certain amount of nodes will always remain in the cluster while there are 2 options to decide how: minAvailable / maxUnavailable.
Now, when I define ReplicaSet I define how many replicas I want. So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
PodDisruptionBudget helps in ensuring zero downtime for an application which ReplicaSet can't guarantee.
The following post explains with an example how PodDisruptionBudget can be useful in achieving zero downtime for an application:
Quoting the post, the node upgrade is a normal scenario as described in:
Let’s consider a scenario, we need to upgrade version of node or
update the spec often. Cluster downscaling is also a normal condition.
In these cases, the pods running on the to-be-deleted nodes needs to
be drained.
kubectl drain is performed on one of the nodes for the upgrade:
We need to remove node1 from the pool which we cannot do it by
detaching instantly as that will lead to termination of all the pods
running in there which can get services down. First step before
detaching node is to make the node unscheduled.
Running kubectl get pods -w will show the pods running on the node get in termination state which leads to a downtime:
If you quickly check the pods with kubectl get pods , it will
terminate all the running pods instantly which were scheduled on node1
. This could lead a downtime! If you are running few number of pods
and all of them are scheduled on same node, it will take some time for
the pods to be scheduled on other node.
PodDisruptionBudget with minAvailable are useful in such scenarios to achieve zero downtime. Replicaset will only ensure that the replicas number of pods will be
created on other nodes during the process.
If you just have a Replicaset with one replica and no PodDisruptionBudget specified, the pod will be terminated and a new pod will be created on other nodes. This is where PDBs provide the added advantage over the Replicaset.
For the PodDisruptionBudget to work, there must be at least 2 pods
running for a label selector otherwise, the node cannot be drained
gracefully and it will be evicted forcefully when grace time ends.
Then what gives the PodDisruptionBudget?
If you have an application where you want high availability e.g. it may take time to rebuild a cache after each crash.
There are both voluntary and involuntary disruptions. PodDisruptionBudget can limit the latter but both counts against the budget.
An example of voluntary disruption is when an employee of your platform team decide to upgrade the kernel for all your nodes - sometimes you want to do this slowly since all Pods on the node will be terminated and scheduled to a different node.
There is also involuntary disruptions e.g. a disk crash on one of your nodes.
So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
It's 2 for minAvailable. And maxAvailable is a wrong name , it's maxUnavailable.
In a project, I'm enabling the cluster autoscaler functionality from Kubernetes.
According to the documentation: How does scale down work, I understand that when a node is used for a given time less than 50% of its capacity, then it is removed, together with all of its pods, which will be replicated in a different node if needed.
But the following problem can happen: what if all the pods related to a specific deployment are contained in a node that is being removed? That would mean users might experience downtime for the application of this deployment.
Is there a way to avoid that the scale down deletes a node whenever there is a deployment which only contains pods running on that node?
I have checked the documentation, and one possible (but not good) solution, is to add an annotation to all of the pods containing applications here, but this clearly would not down scale the cluster in an optimal way.
In the same documentation:
What happens when a non-empty node is terminated? As mentioned above, all pods should be migrated elsewhere. Cluster Autoscaler does this by evicting them and tainting the node, so they aren't scheduled there again.
What is the Eviction ?:
The eviction subresource of a pod can be thought of as a kind of policy-controlled DELETE operation on the pod itself.
Ok, but what if all pods get evicted at the same time on the node?
You can use Pod Disruption Budget to make sure minimum replicas are always working:
What is PDB?:
A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
In k8s docs you can also read:
A PodDisruptionBudget has three fields:
A label selector .spec.selector to specify the set of pods to which it applies. This field is required.
.spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
.spec.maxUnavailable (available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
So if you use PDB for your deployment it should not get deleted all at once.
But please notice that if the node fails for some other reason (e.g hardware failure), you will still experience downtime. If you really care about High Availability consider using pod antiaffinity to make sure the pods are not scheduled all on one node.
Same document you referred to, has this:
How is Cluster Autoscaler different from CPU-usage-based node autoscalers? Cluster Autoscaler makes sure that all pods in the
cluster have a place to run, no matter if there is any CPU load or
not. Moreover, it tries to ensure that there are no unneeded nodes in
the cluster.
CPU-usage-based (or any metric-based) cluster/node group autoscalers
don't care about pods when scaling up and down. As a result, they may
add a node that will not have any pods, or remove a node that has some
system-critical pods on it, like kube-dns. Usage of these autoscalers
with Kubernetes is discouraged.
I am trying to achieve the exactly-once delivery on Kafka using Spring-Kafka on Kubernetes.
As far as I understood, the transactional-ID must be set on the producer and it should be the same across restarts, as stated here https://stackoverflow.com/a/52304789/3467733.
The problem arises using this semantic on Kubernetes. How can you get a consistent ID?
To solve this problem I implementend a Spring boot application, let's call it "Replicas counter" that checks, through the Kubernetes API, how many pods there are with the same name as the caller, so I have a counter for every pod replica.
For example, suppose I want to deploy a Pod, let's call it APP-1.
This app does the following:
It perfoms a GET to the Replicas-Counter passing the pod-name as parameter.
The replicas-counter calls the Kubernetes API in order to check how many pods there are with that pod name. So it does a a +1 and returns it to the caller. I also need to count not-ready pods (think about a first deploy, they couldn't get the ID if I wasn't checking for not-ready pods).
The APP-1 gets the id and will use it as the transactional-id
But, as you can see a problem could arise when performing rolling updates, for example:
Suppose we have 3 pods:
At the beginning we have:
app-1: transactional-id-1
app-2: transactional-id-2
app-3: transactional-id-3
So, during a rolling update we would have:
old-app-1: transactional-id-1
old-app-2: transactional-id-2
old-app-3: transactional-id-3
new-app-3: transactional-id-4 (Not ready, waiting to be ready)
New-app-3 goes ready, so Kubernetes brings down the Old-app-3. So time to continue the rolling update.
old-app-1: transactional-id-1
old-app-2: transactional-id-2
new-app-3: transactional-id-4
new-app-2: transactional-id-4 (Not ready, waiting to be ready)
As you can see now I have 2 pods with the same transactional-id.
As far as I understood, these IDs have to be the same across restarts and unique.
How can I implement something that gives me consistent IDs? Is there someone that have dealt with this problem?
The problem with these IDs are only for the Kubernetes Deployments, not for the Stateful-Sets, as they have a stable identifier as name. I don't want to convert all deployment to stateful sets to solve this problem as I think it is not the correct way to handle this scenario.
The only way to guarantee the uniqueness of Pods is to use StatefulSet.
StatefulSets will allow you to keep the number of replicas alive but everytime pod dies it will be replaced with the same host and configuration. That will prevent data loss that is required.
Service in Statefulset must be headless because since each pod is going to be unique, so you are going to need certain traffic to reach certain pods.
Every pod require a PVC (in order to store data and recreate whenever pod is deleted from that data).
Here is a great article describing why StatefulSet should be used in similar case.
I have deployed Hashicorp's Vault in my Kubernetes clusters on AWS using the Helm chart for it.
The number of replicas in the deployment is specified as 3.
Out of these 3 pods, 1 was ready(1/1) while the other two replica pods were not ready(0/1).
I killed the ready pod and while it was expected that Kubernetes will deploy a new pod to replace it, it deployed two new pods.
Now I have two ready pods and two not ready pods. On deleting one of these pods, now Kubernetes recreates only one pod. So I have 4 instead of 3 pods for my vault deployment.
What could be the reason behind this and how can we prevent this?
Your deployment is not working because HA (high availability) is not available when using the s3 storage backend. You’ll need Hashicorp’s Consul or AWS’s DynamoDB, or a different backend provider for that. Change the number of replicas to 1 if you're sticking with the s3 backend provider.
As far as why your seeing 4 pods instead of 3, you need to provide more details. Paste the output of kubectl get pods -l app=vault as well as kubectl describe deploy -l app=vault and I will update this answer.
I can only offer speculation for what it's worth. With Deployment objects there's a maxSurge property that allows rolling updates to scale up beyond the desired number of replicas. It defaults to 25%, rounded up, which in your case would be an additional 1 pod.
Max Surge
.spec.strategy.rollingUpdate.maxSurge is an optional field that
specifies the maximum number of Pods that can be created over the
desired number of Pods. The value can be an absolute number (for
example, 5) or a percentage of desired Pods (for example, 10%). The
value cannot be 0 if MaxUnavailable is 0. The absolute number is
calculated from the percentage by rounding up. The default value is
25%.
For example, when this value is set to 30%, the new ReplicaSet can be
scaled up immediately when the rolling update starts, such that the
total number of old and new Pods does not exceed 130% of desired Pods.
Once old Pods have been killed, the new ReplicaSet can be scaled up
further, ensuring that the total number of Pods running at any time
during the update is at most 130% of desired Pods.
It's possible that deleting the one Running (1/1) pod, along with the NotReady state of the other pods, put your Deployment into a state of "rolling update" or something along those lines, which allowed your deployment to scale up to its maxSurge setting.
When you have such problems, you should
kubectl describe pod <PROBLEMATIC_POD>
and see the the lower part of your output Events.
Some reasons for your pods not starting may be:
no available nodes with enough resources for your requests
no available volumes
some antiaffinity rules and not enough nodes, so the scheduler cannot assign nodes to your pods.
New to kubernetes. Can I use kubectl scale --replicas=N and start pods on different nodes?
By default the scheduler attempts to spread pods across nodes, so that you don't have multiple pods of the same type on the same node. So there's nothing special required if you're just aiming for best-effort pod spreading.
If you want to express the requirement that the pod must not run on a node that already has a pod of that type on it you can use pod anti-affinity, which is currently an Alpha feature.
If you want to ensure that all nodes (or all nodes matching a certain selector) have that pod on them you can use a DaemonSet.
Scaling a Deployment (or RC) tells controller-manager to create more pods, new pods are then subject to scheduling. K8S scheduler will attempt to find most reasonable placement to schedule your pods to. This does not guarantee that pods will launch on different nodes, but makes it a rather likely scenario, if you have the required resources. Unfortunately it also means that if all pods can fit on one node, there are situations where scheduler might actually do just that (ie. all other nodes in unschedulable state for some reason). If that happens, the pods will not reschedule when conditions change.
To have a solid guarantee that pods wil not get colocated on the same node you have two options:
legacy hack : define a hostPort in your pod template. As given host port is a resource that can be assigned only once per node, your pods will never exist more then once per node
alpha feature : you can look into Pod AntiAffinity, quite early and not really battle proven yet
First one has a dissadvantage - you can never have more then one pod of this type per node, so it ie. affects rolling deployments and limits your capacity for scaling (you can never have more active pods then number of nodes)