Implementation strategy for Kubernetes probes

Implementation strategy for Kubernetes probes - kubernetes

I am struggling with defining a sane implementation strategy for Kubernetes probes for my product. Digging into the available guidelines, both official docs and field reports, I am not able to identify a common approach with consensus in the ecosystem; and most probably that's expected.
Here's what I plan to implement as probe definition rules:
Readiness probe:
Enable it for services handling incoming network traffic.
Question: Any other use case where I should consider a readiness probe for a service?
Liveness probe:
This is the most difficult one for me. What I have in mind as a rule is: Don't define it by default and only enable it manually for services where scenarios as deadlocks are detected and only until these bugs are fixed.
I don't see as a healthy approach to just assume that a service will deadlock and leave the liveness probe handle it. First because it is very hard to identify a service deadlock from the probe and second because it would leave bugs unaddressed.
Question: Any other use case where I should consider a liveness probe?
Startup probe:
Enable it only when there is a liveness probe enabled on that service.
Question: Without a liveness probe defined, is there any advantage in defining a startup probe?

Question: Without a liveness probe defined, is there any advantage in
defining a startup probe?
i am not seeing any advantage in defining the just startup probe without liveness. startup probe is there to protect the slow-starting PODs.
Question: Any other use case where I should consider a liveness probe?
If the service depends on DB or any other service and inside the backend service code you have a function or endpoint to check that you can use that as a liveness probe. You are right in the case of leaving the bug unaddressed.
Selling point to keeping the Liveness probe is that it will auto-restart container POD if it's failing while the readiness probe just changes the status from 0/1 to handle the traffic.
if you don't specify the liveness probe it will just decide the container status based on the PID.
If you are using the bash/sh, dumb-init in your docker you might not want PID 1 to trace instead you want child process PID 2 to track in that liveness probe would be required.

Related

Kubernetes Probes - What is the order in which they examine the pod?

looking to understand the order in which kubenetes examine the pods using the 3 type of probes- startup, readiness and live.
How to understand or design these 3 probes correctly for normal applications? What is the chance of getting conflict or breaking the application if the startup probe has wrong entries

Startup probe
This runs first. When it succeeds, the Readiness Probe and Liveness Probe are run continuously. If this fails, the container is killed.
Use this for "slow staring apps", you can use the same command as Liveness if you want.
The kubelet uses startup probes to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don't interfere with the application startup. This can be used to adopt liveness checks on slow starting containers, avoiding them getting killed by the kubelet before they are up and running.
From configuring probes
Liveness probe
This is used to kill the container, in case of a deadlock in the application.
Readiness probe
This is used to check that the container can receive traffic.

Should I do liveness probe and readiness probe every second?

In my K8S workloads, I implement Readiness probe and Liveness probe for pods health check.
I'm wondering that should I set the interval (periodSeconds) as low as 1 sec, as it will consume more resources, right?
Is there best practices when doing the pod health check?

Firstly, it is important to understand the difference between Liveness and Readiness. The tl;dr is: Liveness is about whether K8s should kill and restart the container, Readiness is about whether the container is able to accept requests. It is likely that you want different parameters for both.
Whether K8s takes any action based on the outcome of the probe depends on the failureThreshold. This is the number of times in a row the probe has to fail before K8s does something. If you combine this with periodSeconds you can tune the sensitivity of your probes.
In general you want to balance:
the time it takes K8s to take action with how quickly your service can be expected to recover based on the probe
the "cost" of the probes. For example if your Readiness probe connects to a database, then you are adding 1 Query Per Second (QPS) load to your database per replica (With 100 replicas, you would be generating 100QPS just through probes!)
the reliability of your probe, also known as "flakiness". What is the false negative rate - i.e what proportion of the time the probe reports failed but the service is actually running with in expected performance rates
Here is one way of thinking about it:
Work out how long your service can be in the failed state before K8s should take action. This should be based on how long it would take to recover (e.g. restart in the case of Liveness)
If a probe is "expensive", have a longer periodSeconds and smaller failureThreshold
If a probe is "flaky" (i.e. occasionally reports failed and then reports working very quickly afterwards) have a shorter periodSeconds and larger failureThreashold.

some problem about initContainer for kubernetes

I have two containers，maybe A and B, which A should run before B, but A is a server application, which the final type is Running but not Complete, so I wonder in this way, will B be never executed? So how can I deal with it?

If A and B are part of the same pod, then initContainer is the legacy way to establish ordering.
From the Kubernetes Pod lifecycle, I suppose you mean "Running, but no Terminated"
A pod liveness/readiness probe is in your case a better fit, since the server will not accept request until ready.
Read "Straight to the Point: Kubernetes Probes" from Peter Malina
Both readiness and liveness probe run in parallel throughout the life of a container.
Use the liveness probe to detect an internal failure and restart the container (e.g. HTTP server down).
Use the readiness probe to detect if you can serve traffic (e.g. established DB connection) and wait (not restart) for the container.
A dead container is also not a ready container.
To serve traffic, all containers within a pod must be ready.
You can add a pod readiness gate (stable from 1.14) to specify additional conditions to be evaluated for Pod readiness.
Read also "Kubernetes Liveness and Readiness Probes: How to Avoid Shooting Yourself in the Foot" from Colin Breck
"Should Health Checks call other App Health Checks" compares that approach with the InitContainer approach

How do I make Kubernetes evict a pod that has not been ready for a period of time?

I have readiness probes configured on several pods (which are members of deployment-managed replica sets). They work as expected -- readiness is required as part of the deployment's rollout strategy, and if a healthy pod becomes NotReady, the associated Service will remove it from the pool of endpoints until it becomes Ready again.
Furthermore, I have external health checking (using Sensu) that alerts me when a pod becomes NotReady.
Sometimes, a pod will report NotReady for an extended period of time, showing no sign of recovery. I would like to configure things such that, if a pod stays in NotReady for an extended period of time, it gets evicted from the node and rescheduled elsewhere. I'll settle for a mechanism that simply kills the container (leading it to be restarted in the same pod), but what I really want is for the pod to be evicted and rescheduled.
I can't seem to find anything that does this. Does it exist? Most of my searching turns up things about evicting pods from NotReady nodes, which is not what I'm looking for at all.
If this doesn't exist, why? Is there some other mechanism I should be using to accomplish something equivalent?
EDIT: I should have specified that I also have liveness probes configured and working the way I want. In the scenario I’m talking about, the pods are “alive.” My liveness probe is configured to detect more severe failures and restart accordingly and is working as desired.
I’m really looking for the ability to evict based on a pod being live but not ready for an extended period of time.
I guess what I really want is the ability to configure an arbitrary number of probes, each with different expectations it checks for, and each with different actions it will take if a certain failure threshold is reached. As it is now, liveness failures have only one method of recourse (restart the container), and readiness failures also have just one (just wait). I want to be able to configure any action.

As of Kubernetes v1.15, you might want to use a combination of readiness probe and liveness probe to achieve the outcome that you want . See configure liveness and readiness probes.
A new feature to start the liveness probe after the pod is ready is likely to be introduced in v1.16. There will be a new probe called startupProbe that can handle this in a more intuitive manner.

You can configure HTTP liveness probe or TCP liveness probe with periodSeconds depends on the type of the container images.
livenessProbe:
.....
initialDelaySeconds: 3
periodSeconds: 5 [ This field specifies that kubelet should perform liveness probe every 3 seconds. ]

You may try to use for that purpose Prometheus metrics and create an alert like here. Based on that you can configure a webhook in alertmanager, which will react properly ( action: kill POD ) and the Pod will be then recreated by the scheduler.

How to leverage Kubernetes readinessProbe to self heal the pod without restarting it?

According to this documentation, I see that readinessProbe can be used to temporarily halt requests to a pod without having to restart it in order to recover gracefully.
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes
When I see pod events it looks like the pod is restarted upon Readiness probe failure. Here are the events:
1. Readiness probe failed
2. Created container
3. Started container
4. Killing container with id {}
Tried to modify container restartPolicy to OnFailure hoping this configuration decides pod action upon readinessProbe failure but I see the following error:
The Deployment {} is invalid: spec.template.spec.restartPolicy: Unsupported value: "OnFailure": supported values: "Always"
Which is the right way to stop requests to a pod without having to restart it and letting the application gracefully recover?

There are two type of probes.
Restarts happen due to failing liveness probes.
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
liveness probe
The kubelet uses liveness probes to know when to restart a Container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a Container in such a state can help to make the application more available despite bugs.
readiness probe
Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don’t want to kill the application, but you don’t want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.
Today I found a very good essay about probes https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse