how to quickly fail the Kubernetes Readiness probe? - kubernetes

Incase a pod goes down in my cluster, it takes around 15secs or more to determine the failure by readiness probe logic, which is not accepted because of call failure (since kubernetes is not identified the pod failure so it will send the traffic to the failed pod / I mean the failed pod is still in the cluster-P service endpoint).
Please suggest here, how to fail the readiness probe immediately or how to remove the endpoint immediately in case of failure, without much reduce the periodSeconds to below 5secs.
Below is my configuration:
initialDelaySeconds:90s
periodSeconds:5s
timeoutSeconds:2s
successThreshold:<default>
failureThreshold:<default>
Thanking in advance.

What you can do is to adjust you probe's configuration in order to meet you requirements:
Probes have a number of fields that you can use to more precisely
control the behavior of liveness and readiness checks:
initialDelaySeconds: Number of seconds after the container has started before liveness or readiness probes are initiated. Defaults to
0 seconds. Minimum value is 0.
periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
successThreshold: Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1
for liveness. Minimum value is 1.
failureThreshold: When a probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness
probe means restarting the container. In case of readiness probe the
Pod will be marked Unready. Defaults to 3. Minimum value is 1.
You haven't specified the failureThreshold so it defaults to 3. The values you are currently using would take ~15-20 seconds to consider the pod as failed and restart it.
If you set the minimal values for periodSeconds, timeoutSeconds, successThreshold and failureThreshold you can expect more frequent checks and faster pod recreations.

Related

In Kubernetes, why does readinessProbe have the option periodSeconds?

I understand what a readinessProbe does, but I don't see why it should have a periodSeconds. Once it's determined that the pod is ready, it should stop checking. Wouldn't checking periodically then be up to the livenessProbe? Or am I missing something?
ReadinesProbe and livenessProbe serve for different purposes.
ReadinessProbe checks if a service is ready to serve requests. If the readinessProbe fails, the container will be taken out of service for as long as the probe fails. If the readinessProbe reports up again, the container will be taken back into service again and receive requests.
In contrast, if the livenessProbe fails, it will be considered as not recoverable and the container will be terminated and restarted.
For both, periodSeconds makes sense. Even for the livenessProbe, when failure is considered only after X consecutive failed checks.
Readiness probes determine whether or not a container is ready to serve requests. If the readiness probe returns a failed state, then Kubernetes removes the IP address for the container from the endpoints of all Services.
We use readiness probes to instruct Kubernetes that a running container should not receive any traffic. This is useful when waiting for an application to perform time-consuming initial tasks, such as establishing network connections, loading files, and warming caches.
The readiness probe is configured in the spec.containers.readinessprobe attribute of the pod configuration.
This periodSeconds field specifies that the kubelet should perform a readiness probe for every “x” seconds that is mentioned in the yaml. This Specifies the frequency of the checks to the readiness probe to check. Default to 10 seconds. Minimum value is 1.

Kuberentes Update Policy Wait Until Actually Ready

I have a K8 cluster running with a deployment which has an update policy of RollingUpdate. How do I get Kubernetes to wait an extra amount of seconds or until some condition is met before marking a container as ready after starting?
An example would be having an API server with no downtime when deploying an update. But after the container starts it still needs X amount of seconds before it is ready to start serving HTTP requests. If it marks it as ready immediately once the container starts and the API server isn't actually ready there will be some HTTP requests that will fail for a brief time window.
Posting #David Maze comment as community wiki for better visibility:
You need a readiness probe; the pod won't show as "ready" (and the deployment won't proceed) until the probe passes.
Example:
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
initialDelaySeconds: Number of seconds after the container has
started before liveness or readiness probes are initiated. Defaults
to 0 seconds. Minimum value is 0.
periodSeconds: How often (in seconds) to perform the probe. Default
to 10 seconds. Minimum value is 1.

I am using 8888 for kubernetes health probes and 8887 for normal HTTP requests. So if readiness probe fails, should i still expect traffic on 8887?

I am using 8888 for liveness & readiness probes, 8887 for normal HTTP requests, readiness probe is failing and pods are in 0/1, not ready state. ButI still see normal POST requests being served by the pod. Is this expected. should health probes and normal requests be received on the same port?
Liveness and readyness probes have different purposes. In short the liveness probe controls whether Kubernetes will restart the pod. But the readyness probe controls whether a pod is included in the endpoints of a service. Unless a pod has indicated it's ready through the readyness probe, it should not receive traffic through a service. That doesn't mean it can't be sent requests, it just means it won't be sent traffic through the service. So in your case the question is, where are those POST requests coming from.
#pst and #Harsh are right but I would like to expand on it a bit.
As the official docs say:
If you'd like to start sending traffic to a Pod only when a probe
succeeds, specify a readiness probe. In this case, the readiness probe
might be the same as the liveness probe, but the existence of the
readiness probe in the spec means that the Pod will start without
receiving any traffic and only start receiving traffic after the
probe starts succeeding.
and:
The kubelet uses readiness probes to know when a container is ready to
start accepting traffic. A Pod is considered ready when all of its
containers are ready. One use of this signal is to control which Pods
are used as backends for Services. When a Pod is not ready, it is
removed from Service load balancers.
Answering your question:
So if readiness probe fails, should i still expect traffic on 8887?
No, the pod should not start receiving traffic if the readiness probe fails.
It can also depend on your app. By using a readiness probe, Kubernetes waits until the app is fully started before it allows the service to send traffic to the new copy.
Also, it is very important to make sure your probes are configured properly:
Probes have a number of fields that you can use to more precisely
control the behavior of liveness and readiness checks:
initialDelaySeconds: Number of seconds after the container has started before liveness or readiness probes are initiated. Defaults to
0 seconds. Minimum value is 0.
periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
successThreshold: Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1
for liveness. Minimum value is 1.
failureThreshold: When a probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness
probe means restarting the container. In case of readiness probe the
Pod will be marked Unready. Defaults to 3. Minimum value is 1.
If you wish to expand your knowledge regarding the liveness, readiness and startup probes please refer to the official docs. You will wind some examples there that can be compared with your setup in order to see if you understand and configured it right.

k8s - livenessProbe vs readinessProbe

Consider a pod which has a healthcheck setup via a http endpoint /health at port 80 and it takes almost 60 seconds to be actually ready & serve the traffic.
readinessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 60
livenessProbe:
httpGet:
path: /health
port: 80
Questions:
Is my above config correct for the given requirement?
Does liveness probe start working only after the pod becomes ready ? In other words, I assume readiness probe job is complete once the POD is ready. After that livenessProbe takes care of health check. In this case, I can ignore the initialDelaySeconds for livenessProbe. If they are independent, what is the point of doing livenessProbe check when the pod itself is not ready! ?
Check this documentation. What do they mean by
If you want your Container to be able to take itself down for
maintenance, you can specify a readiness probe that checks an endpoint
specific to readiness that is different from the liveness probe.
I was assuming, the running pod will take itself down only if the livenessProbe fails. not the readinessProbe. The doc says other way.
Clarify!
I'm starting from the second problem to answer. The second question is:
Does liveness probe start working only after the pod becomes ready?
In other words, I assume readiness probe job is complete once the POD
is ready. After that livenessProbe takes care of health check.
Our initial understanding is that liveness probe will start to check after readiness probe was succeeded but it turn out not to be like that. It has opened an issue for this challenge.Yon can look up to here. Then It was solved this problem by adding startup probes.
To sum up:
livenessProbe
livenessProbe: Indicates whether the Container is running. If the
liveness probe fails, the kubelet kills the Container, and the
Container is subjected to its restart policy. If a Container does not
provide a liveness probe, the default state is Success.
readinessProbe
readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod. The default state of readiness before the initial delay is Failure. If a Container does not provide a readiness probe, the default state is Success.
startupProbe
startupProbe: Indicates whether the application within the Container is started. All other probes are disabled if a startup probe is provided, until it succeeds. If the startup probe fails, the kubelet kills the Container, and the Container is subjected to its restart policy. If a Container does not provide a startup probe, the default state is Success
look up here.
The liveness probes are to check if the container is started and alive. If this isn’t the case, kubernetes will eventually restart the container.
The readiness probes in turn also check dependencies like database connections or other services your container is depending on to fulfill it’s work. As a developer you have to invest here more time into the implementation than just for the liveness probes. You have to expose an endpoint which is also checking the mentioned dependencies when queried.
Your current configuration uses a health endpoint which are usually used by liveness probes. It probably doesn’t check if your services is really ready to take traffic.
Kubernetes relies on the readiness probes. During a rolling update, it will keep the old container up and running until the new service declares that it is ready to take traffic. Therefore the readiness probes have to be implemented correctly.
I will show the difference between them in a couple of simple points:
livenessProbe
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
It is used to indicate if the container has started and is alive or not i.e. proof of being available.
In the given example, if the request fails, it will restart the container.
If not provided the default state is Success.
readinessProbe
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
It is used to indicate if the container is ready to serve traffic or not i.e.proof of being ready to use.
It checks dependencies like database connections or other services your container is depending on to fulfill its work.
In the given example, until the request returns Success, it won't serve any traffic(by removing the Pod’s IP address from the endpoints of all Services that match the Pod).
Kubernetes relies on the readiness probes during rolling updates, it keeps the old container up and running until the new service declares that it is ready to take traffic.
If not provided the default state is Success.
Summary
Liveness Probes: Used to check if the container is available and alive.
Readiness Probes: Used to check if the application is ready to be used and serve the traffic.
Both readiness probe and liveness probe seem to have same behavior. They do same type of checks. But the action they take in case of failures is different.
Readiness Probe shuts the traffic from service down. so that service can always the send the request to healthy pod whereas the liveness probe restarts the pod in case of failure. It does not do anything for the service. Service continues to send the request to the pods as usual if it is in ‘available’ status.
It is recommended to use both probes!!
Check here for detailed explanation with code samples.
The Kubernetes platform has capabilities for validating container applications, called healthchecks. Liveness is proof of availability and readness is proof of pod readiness is ready to use.
The features are designed to prevent service downtime and inconsistent images by enabling restarts when needed. Kubernetes uses liveness to know when to restart the container, so it can solve most problems. Kubernetes uses readness to know when the container is available to accept requests. The pod is considered ready when all containers are ready. Therefore, when the pod takes too long to initialize (by cache mount, DB schema, etc.) it is recommended to increase initialDelaySeconds.
I'd post it as a comment but it's too long, So let's make it a full answer.
Is my above config correct for the given requirement?
IMHO no, you are missing initialDelaySeconds for both probes and liveness and rediness probably should not call the same endpoint. I'd use the suggestionss form #fgul
Does liveness probe start working only after the pod becomes ready ?
In other words, I assume readiness probe job is complete once the POD
is ready. After that livenessProbe takes care of health check. In this
case, I can ignore the initialDelaySeconds for livenessProbe. If they
are independent, what is the point of doing livenessProbe check when
the pod itself is not ready! ?
I think you were thinking about startupProbe, again #fgul described what does what so there is no point in me repeating.
I was assuming, the running pod will take itself down only if the
livenessProbe fails. not the readinessProbe. The doc says other way.
The pod can be restarted only based on livenessProbe, not the redinessProbe.
I'd think twice before binding a rediness probe with external services (being alive as #randy advised), especially in high load services:
Let's assume you have define a deployment with lots of pods, that are connecting to a database and are processing lots of requests.
Now the database goes down.
The rediness probe is checking also db connection and it marks all of the pods as "out of service".
Now the db goes up.
Pods rediness probe will start to pass but not instantly and on all pods right away - the pods will be marked as "Ready" one after an other.
But it might be too slow - the second the first pod will be marked as ready, ALL of the traffic will be sent to this one pod alone. It might end in a situation that the "waking up" pods will be killed by the traffic one after an other.
For that kind of situation I'd say the rediness pod should check only pod internal stuff and don't care about the externall services. The kubernetes endpoint will return an error and either the clients might support failing service (it's called "designed for failure") or the loadbalancer/ingress can cover it.
I think the below image describes the use-cases for each.
Liveness probes are a relatively specialized tool, and you probably don't want one at all. However they run totally independently AFAIK.

How to determine a failed kubernetes deployment?

I create a Pod with Replica count of say 2, which runs an application ( a simple web-server), basically it's always running command - However due to mis-configuration, sometimes the command exits and the pod is then Terminated.
Due to default restartPolicy of Always the pod (and hence the container) is restarted and eventually the Pod status is CrashLoopBackOff.
If I do a kubectl describe deployment, it shows Condition as Progressing=True and Available=False.
This looks fine - the question is - how do I mark my deployment as 'failed' in the above case?
Adding a spec.ProgressDeadlineSeconds doesn't seem to be having an effect.
Will simply saying restartPolicy as Never be enough in the Pod specification?
A related question, is there a way of getting this information as a trigger/webhook, without doing a rollout status watch?
A bit of theory
Regarding your question:
How do I mark my deployment as 'failed' in the above case?
Kubernetes gives you two types of health checks:
1 ) Readiness
Readiness probes are designed to let Kubernetes know when your app is ready to serve traffic.
Kubernetes makes sure the readiness probe passes before allowing a service to send traffic to the pod.
If a readiness probe starts to fail, Kubernetes stops sending traffic to the pod until it passes.
2 ) Liveness
Liveness probes let Kubernetes know if your app is alive or dead.
If you app is alive, then Kubernetes leaves it alone. If your app is dead, Kubernetes removes the Pod and starts a new one to replace it.
At the moment (v1.19.0) , Kubernetes has support for 3 types mechanisms for implementing liveness and readiness probes:
A ) ExecAction: Executes a specified command inside the container. The diagnostic is considered successful if the command exits with a status code of 0.
B ) TCPSocketAction: Performs a TCP check against the Pod's IP address on a specified port. The diagnostic is considered successful if the port is open.
C ) HTTPGetAction: Performs an HTTP GET request against the Pod's IP address on a specified port and path. The diagnostic is considered successful if the response has a status code greater than or equal to 200 and less than 400.
In your case:
If the process in your container is able to crash on its own whenever it encounters an issue or becomes unhealthy, you do not necessarily need a liveness probe; the kubelet will automatically perform the correct action in accordance with the Pod's restartPolicy.
I think that in your case (the need to refer to a deployment as succeed / failed and take the proper action) you should:
Step 1:
Setup a HTTP/TCP readiness Probe - for example:
readinessProbe:
httpGet:
path: /health-check
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
Where:
initialDelaySeconds — The number of seconds since the container has started before the readiness probe can be initiated.
periodSeconds — How often to perform the readiness probe.
failureThreshold — The number of tries to perform the readiness probe if the probe fails on pod start.
Step 2:
Choose the relevant rolling update strategy and how you should handle cases of failures of new pods (consider reading this thread for examples).
A few references you can follow:
Container probes
Kubernetes Liveness and Readiness Probes
Kubernetes : Configure Liveness and Readiness Probes
Kubernetes and Containers Best Practices - Health Probes
Creating Liveness Probes for your Node.js application in Kubernetes
A Failed Deployment
A deployment (or the rollout process) will be considered as Failed
if it tries to deploy its newest ReplicaSet without ever completing over and over again until the progressDeadlineSeconds interval has exceeded.
Then K8S you update the status with:
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing False ProgressDeadlineExceeded
ReplicaFailure True FailedCreate
Read more in here.
There is no Kubernetes concept for a "failed" deployment. Editing a deployment registers your intent that the new ReplicaSet is to be created, and k8s will repeatedly try to make that intent happen. Any errors that are hit along the way will cause the rollout to block, but they will not cause k8s to abort the deployment.
AFAIK, the best you can do (as of 1.9) is to apply a deadline to the Deployment, which will add a Condition that you can detect when a deployment gets stuck; see https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment and https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds.
It's possible to overlay your own definitions of failure on top of the statuses that k8s provides, but this is quite difficult to do in a generic way; see this issue for a (long!) discussion on the current status of this: https://github.com/kubernetes/kubernetes/issues/1899
Here's some Python code (using pykube) that I wrote a while ago that implements my own definition of ready; I abort my deploy script if this condition does not obtain after 5 minutes.
def _is_deployment_ready(d, deployment):
if not deployment.ready:
_log.debug('Deployment not completed.')
return False
if deployment.obj["status"]["replicas"] > deployment.replicas:
_log.debug('Old replicas not terminated.')
return False
selector = deployment.obj['spec']['selector']['matchLabels']
pods = Pod.objects(d.api).filter(namespace=d.namespace, selector=selector)
if not pods:
_log.info('No pods found.')
return False
for pod in pods:
_log.info('Is pod %s ready? %s.', pod.name, pod.ready)
if not pod.ready:
_log.debug('Pod status: %s', pod.obj['status'])
return False
_log.info('All pods ready.')
return True
Note the individual pod check, which is required because a deployment seems to be deemed 'ready' when the rollout completes (i.e. all pods are created), not when all of the pods are ready.