Readiness probe for statefulset, not individual pod/container - kubernetes

I have been reading about liveness and readiness probes in kubernetes and I would like to use them to check and see if a cluster has come alive.
The question is how to configure a readiness probe for an entire statefulset, and not an individual pod/container.
A simple HTTP check can be used to determine readiness, but the issue I'm running into is that the readinessCheck seems to apply to the container/pod and not to the set itself.
For the software I'm using, the HTTP endpoint doesn't come up until the cluster forms; meaning that each individual pod would fail the readinessCheck until all three are up and find one another.
The behavior I'm seeing in Kubernetes right now is that the first of 3 replicas is created, and Kubernetes does not even attempt to create replicas 2 and 3 until the first passes the readinessCheck, which never happens, because all three have to be up for it to have a chance to pass it.

You need to change .spec.podManagementPolicy for a StatefulSet from OrderedReady to Parallel policy.
This way K8S will start all your pods in parallel and won't wait for probes.
From documentation
podManagementPolicy controls how pods are created during initial scale
up, when replacing pods on nodes, or when scaling down. The default
policy is OrderedReady, where pods are created in increasing order
(pod-0, then pod-1, etc) and the controller will wait until each pod
is ready before continuing. When scaling down, the pods are removed in
the opposite order. The alternative policy is Parallel which will
create pods in parallel to match the desired scale without waiting,
and on scale down will delete all pods at once.

In addition to settings .spec.podManagementPolic: Parralel, on the statefulset, it may be required to set .spec.publishNotReadyAddresses: true, on the governing headless service to allow the stateful set pods to communicate with each other.
StatefulSet allows you to relax its ordering guarantees while preserving its uniqueness and identity guarantees via its .spec.podManagementPolicy field.
ref: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-management-policies
Also, Pod needs to become ready in order to have a record unless publishNotReadyAddresses=True is set on the Service.
ref: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

Related

Will k8s scale a pod within HPA range to evict it and meet disruption budget?

excuse me for asking something that has much overlap with many specific questions about the same knowledge area. I am curious to know if kubernetes will scale a pod in order to evict it.
Given are the following facts at the time of eviction:
The pod is running one instance.
The pod has an HPA controlling it, with the following params:
minCount: 1
maxCount: 2
It has a PDB with params:
minAvailable: 1
I would expect the k8s controller to have enough information to safely scale up to 2 instances to meet the PDB, and until recently I was assuming it would indeed do so.
Why am I asking this? (The question behind the question ;)
Well, we run into auto-upgrade problems on AKS because it won't evict pods as described above, and the Azure team told me to change the params. But if no scaling happens, this means we have to set minAvailable to 2, effectively increasing pod amount only for future evictions. I want to get to the bottom of this before I file a feature request with k8s or a bug with AKS.
I believe these two parts are independent; the pod disruption budget doesn't look at the autoscaling capability, or otherwise realize that a pod is running as part of a deployment that could be temporarily upscaled.
If you have a deployment with replicas: 1, and a corresponding PDB with minAvailable: 1, this will prevent the node the pod is running on from being taken out of service. (I see this behavior in the system I work on professionally, using a different Kubernetes environment.)
The way this works normally (see also the PodDisruptionBudget example in the Kubernetes documentation):
Some command like kubectl drain or the cluster autoscaler marks a node as going out of service.
The pods on that node are terminated.
The replication controller sees that some replica sets have too few pods, and creates new ones.
The new pods get scheduled on in-service nodes.
The pod disruption budget only affects the first part of this sequence; it would keep kubectl drain from actually draining a node until the disruption budget could be satisfied, or cause the cluster autoscaler to pick a different node. HPA isn't considered at all, nor is it considered that it's "normal" to run extra copies of a deployment-managed pod during upgrades. (That is, this is a very reasonable question, it just doesn't work that way right now.)
My default setup for most deployments tends to be to use 3 replicas and to have a pod disruption budget requiring at least 1 of them to be available. That definitely adds some cost to operating the service, but it makes you tolerant of an involuntary node failure and it does allow you to consciously rotate nodes out. For things that read from message queues (Kafka or RabbitMQ-based workers) it could make sense to run only 1 replica with no PDB since the worker will be able to tolerate an outage.

Kubernetes - ReplicaSet vs PodDisruptionBudget

I was wondering what added value gives the PodDisruptionBudget.
As far as I understand, PodDisruptionBudget promises that a certain amount of nodes will always remain in the cluster while there are 2 options to decide how: minAvailable / maxUnavailable.
Now, when I define ReplicaSet I define how many replicas I want. So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
PodDisruptionBudget helps in ensuring zero downtime for an application which ReplicaSet can't guarantee.
The following post explains with an example how PodDisruptionBudget can be useful in achieving zero downtime for an application:
Quoting the post, the node upgrade is a normal scenario as described in:
Let’s consider a scenario, we need to upgrade version of node or
update the spec often. Cluster downscaling is also a normal condition.
In these cases, the pods running on the to-be-deleted nodes needs to
be drained.
kubectl drain is performed on one of the nodes for the upgrade:
We need to remove node1 from the pool which we cannot do it by
detaching instantly as that will lead to termination of all the pods
running in there which can get services down. First step before
detaching node is to make the node unscheduled.
Running kubectl get pods -w will show the pods running on the node get in termination state which leads to a downtime:
If you quickly check the pods with kubectl get pods , it will
terminate all the running pods instantly which were scheduled on node1
. This could lead a downtime! If you are running few number of pods
and all of them are scheduled on same node, it will take some time for
the pods to be scheduled on other node.
PodDisruptionBudget with minAvailable are useful in such scenarios to achieve zero downtime. Replicaset will only ensure that the replicas number of pods will be
created on other nodes during the process.
If you just have a Replicaset with one replica and no PodDisruptionBudget specified, the pod will be terminated and a new pod will be created on other nodes. This is where PDBs provide the added advantage over the Replicaset.
For the PodDisruptionBudget to work, there must be at least 2 pods
running for a label selector otherwise, the node cannot be drained
gracefully and it will be evicted forcefully when grace time ends.
Then what gives the PodDisruptionBudget?
If you have an application where you want high availability e.g. it may take time to rebuild a cache after each crash.
There are both voluntary and involuntary disruptions. PodDisruptionBudget can limit the latter but both counts against the budget.
An example of voluntary disruption is when an employee of your platform team decide to upgrade the kernel for all your nodes - sometimes you want to do this slowly since all Pods on the node will be terminated and scheduled to a different node.
There is also involuntary disruptions e.g. a disk crash on one of your nodes.
So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
It's 2 for minAvailable. And maxAvailable is a wrong name , it's maxUnavailable.

Can a pod run on multiple nodes?

I have one kubernetes master and three kubernetes nodes. I made one pod which is running on specific node. I want to run that pod on 2 nodes. how can I achieve this? do replica concept help me? if yes how?
Yes, you can assign pods to one or more nodes of your cluster, and here are some options to achieve this:
nodeSelector
nodeSelector is the simplest recommended form of node selection constraint. nodeSelector is a field of PodSpec. It specifies a map of key-value pairs. For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels (it can have additional labels as well). The most common usage is one key-value pair.
affinity and anti-affinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
nodeSelector provides a very simple way to constrain pods to nodes with particular labels. The affinity/anti-affinity feature, greatly expands the types of constraints you can express. The key enhancements are
The affinity/anti-affinity language is more expressive. The language offers more matching rules besides exact matches created with a logical AND operation;
you can indicate that the rule is "soft"/"preference" rather than a hard requirement, so if the scheduler can't satisfy it, the pod will still be scheduled;
you can constrain against labels on other pods running on the node (or other topological domain), rather than against labels on the node itself, which allows rules about which pods can and cannot be co-located
DaemonSet
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
Some typical uses of a DaemonSet are:
running a cluster storage daemon on every node
running a logs collection daemon on every node
running a node monitoring daemon on every node
Please check this link to read more about how to assign pods to nodes.
It's not a good practice to run the pods directly on the nodes as the nodes/pods can crash at any time. It's better use the K8S controllers as mentioned in the K8S documentation here.
K8S supports multiple containers and depending on the requirement the appropriate controller can be used. By looking at the OP it's difficult to say which controller to use.
You can use daemonset, if you want to run pod on each node.
What I see is you are trying to deploy pod on each node, it's better if you allow the scheduler to make decision where the pod need to be deployed based on the resources.
This would be best in all worst scenario's.
I'm mean in case of node failures.

Kubernetes Autoscaler: no downtime for deployments when downscaling is possible?

In a project, I'm enabling the cluster autoscaler functionality from Kubernetes.
According to the documentation: How does scale down work, I understand that when a node is used for a given time less than 50% of its capacity, then it is removed, together with all of its pods, which will be replicated in a different node if needed.
But the following problem can happen: what if all the pods related to a specific deployment are contained in a node that is being removed? That would mean users might experience downtime for the application of this deployment.
Is there a way to avoid that the scale down deletes a node whenever there is a deployment which only contains pods running on that node?
I have checked the documentation, and one possible (but not good) solution, is to add an annotation to all of the pods containing applications here, but this clearly would not down scale the cluster in an optimal way.
In the same documentation:
What happens when a non-empty node is terminated? As mentioned above, all pods should be migrated elsewhere. Cluster Autoscaler does this by evicting them and tainting the node, so they aren't scheduled there again.
What is the Eviction ?:
The eviction subresource of a pod can be thought of as a kind of policy-controlled DELETE operation on the pod itself.
Ok, but what if all pods get evicted at the same time on the node?
You can use Pod Disruption Budget to make sure minimum replicas are always working:
What is PDB?:
A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
In k8s docs you can also read:
A PodDisruptionBudget has three fields:
A label selector .spec.selector to specify the set of pods to which it applies. This field is required.
.spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
.spec.maxUnavailable (available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
So if you use PDB for your deployment it should not get deleted all at once.
But please notice that if the node fails for some other reason (e.g hardware failure), you will still experience downtime. If you really care about High Availability consider using pod antiaffinity to make sure the pods are not scheduled all on one node.
Same document you referred to, has this:
How is Cluster Autoscaler different from CPU-usage-based node autoscalers? Cluster Autoscaler makes sure that all pods in the
cluster have a place to run, no matter if there is any CPU load or
not. Moreover, it tries to ensure that there are no unneeded nodes in
the cluster.
CPU-usage-based (or any metric-based) cluster/node group autoscalers
don't care about pods when scaling up and down. As a result, they may
add a node that will not have any pods, or remove a node that has some
system-critical pods on it, like kube-dns. Usage of these autoscalers
with Kubernetes is discouraged.

How do I schedule the same pod on different nodes using kubectl scale?

New to kubernetes. Can I use kubectl scale --replicas=N and start pods on different nodes?
By default the scheduler attempts to spread pods across nodes, so that you don't have multiple pods of the same type on the same node. So there's nothing special required if you're just aiming for best-effort pod spreading.
If you want to express the requirement that the pod must not run on a node that already has a pod of that type on it you can use pod anti-affinity, which is currently an Alpha feature.
If you want to ensure that all nodes (or all nodes matching a certain selector) have that pod on them you can use a DaemonSet.
Scaling a Deployment (or RC) tells controller-manager to create more pods, new pods are then subject to scheduling. K8S scheduler will attempt to find most reasonable placement to schedule your pods to. This does not guarantee that pods will launch on different nodes, but makes it a rather likely scenario, if you have the required resources. Unfortunately it also means that if all pods can fit on one node, there are situations where scheduler might actually do just that (ie. all other nodes in unschedulable state for some reason). If that happens, the pods will not reschedule when conditions change.
To have a solid guarantee that pods wil not get colocated on the same node you have two options:
legacy hack : define a hostPort in your pod template. As given host port is a resource that can be assigned only once per node, your pods will never exist more then once per node
alpha feature : you can look into Pod AntiAffinity, quite early and not really battle proven yet
First one has a dissadvantage - you can never have more then one pod of this type per node, so it ie. affects rolling deployments and limits your capacity for scaling (you can never have more active pods then number of nodes)