Kubernetes Service unavailable when container crashes - kubernetes

In my Kubernetes cluster, I have a single pod (i.e. one replica) with two containers: server and cache.
I also have a Kubernetes Service that matches my pod.
If cache is crashing, when I try to send an HTTP request to server via my Service, I get a "503 Service Temporarily Unavailable".
The HTTP request is going into the cluster via Nginx Ingress, and I suspect that the problem is that when cache is crashing, Kubernetes removes my one pod from the Service load balancers, as promised in the Kubernetes documentation:
The kubelet uses readiness probes to know when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.
I don't prefer this behavior, since I still want to be able server to respond to requests even if cache has failed. Is there any way to get this desired behavior?

A POD is brought to the "Failed" state if one of the following conditions occur
One of its containers exit with non-zero status
Kubernates terminates a container due to health checker failing
So, if you need one of your containers to still respond when another one fails,
Make sure your liveliness probe is pointed to the container you need to be continuing. The health checker will get success code always and will not mark the POD as "Failed"
Make sure the readiness probe is pointed to the container you neesd to be continuing. This will make sure that the load balancer will still send the traffic to your pod.
Make sure that you handle the container errors gracefully and make them exit with zero status code.
In the following example readiness and liveliness probes, make sure that the port 8080 is handled by the service container and it has the /healthz and /ready routes active.
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 1

The behavior I am looking for is configurable on the Service itself via the publishNotReadyAddresses option:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.21/#servicespec-v1-core

Related

how can I schedule Healthcheck in Kubernetes livenessProbe

since we need to stop a service in the kubernetes pot at night because we need to index something in our app(we also can't stop the pod because we need to index somethin in the pod.). While Index HealthEndpoint is unreachable, Prometheus warns that the service is not live. I want to disable liveness at night.
livenessProbe:
httpGet:
path: /healthcheck
port: 8080
initialDelaySeconds: 60
periodSeconds: 3
I've searched a lot on the internet but haven't found a solution.
You have to use readinessProbe. The difference is that a service can recover from un-readiness while it cannot from failed liveness.
So as long as readinessProbe reports a failure, your pod will be taken out of service, i.e. it will not receive any requests, until readinessProbe reports health again.
Note that there is no connection between livenessProbe and readinessProbe. Both are checked independent from each other. So you may and should implement different check endpoints for them. While the readiness endpoint temporarily reports unhealthiness, your liveness endpoint still has to report health. Of course, you can ommit a livenessProbe if your are comfortable with that.

Why do I need 3 different kind of probes in kubernetes: startupProbe, readinessProbe, livenessProbe

Why do I need 3 different kind of probes in kubernetes:
startupProbe
readinessProbe
livenessProbe
There are some questions (k8s - livenessProbe vs readinessProbe, Setting up a readiness, liveness or startup probe) and articles about this topic. But this is not so clear:
Why do I need 3 different kind of probes?
What are the use cases?
What are the best practises?
These 3 kind of probes have 3 different use cases. That's why we need 3 kind of probes.
Liveness Probe
If the Liveness Probe fails, the pod will be restarted (read more about failureThreshold).
Use case: Restart pod, if the pod is dead.
Best practices: Only include basic checks in the liveness probe. Never include checks on connections to other services (e.g. database). The check shouldn't take too long to complete.
Always specify a light Liveness Probe to make sure that the pod will be restarted, if the pod is really dead.
Startup Probe
Startup Probes check, when the pod is available after startup.
Use case: Send traffic to the pod, as soon as the pod is available after startup. Startup probes might take longer to complete, because they are only called on initializing. They might call a warmup task (but also consider init containers for initialization). After the Startup probe succeeds, the liveliness probe is called.
Best practices: Specify a Startup Probe, if the pod takes a long time to start. The Startup and Liveness Probe can use the same endpoint, but the Startup Probe can have a less strict failure threshhold which prevents a failure on startup (s. Kubernetes in Action).
Readiness Probe
In contrast to Startup Probes Readiness Probes check, if the pod is available during the complete lifecycle.
In contrast to Liveness Probes only the traffic to the pod is stopped, if the Readiness probe fails, but there will be no restart.
Use case: Stop sending traffic to the pod, if the pod can not temporarily serve because a connection to another service (e.g. database) fails and the pod will recover later.
Best practices: Include all necessary checks including connections to vital services. Nevertheless the check shouldn't take too long to complete.
Always specify a Readiness Probe to make sure that the pod only gets traffic, if the pod can properly handle incoming requests.
Documentation
This article explains very well the differences between the 3 kind of probes.
The Official kubernetes documentation gives a good overview about all configuration options.
Best practises for probes.
The book Kubernetes in Action gives most detailed insights about the best practises.
The difference between livenessProbe, readinessProbe, and startupProbe
livenessProbe:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
It is used to indicate if the container has started and is alive or not i.e. proof of being avaliable.
In the given example, if the request fails, it will restart the container.
If not provided the default state is Success.
readinessProbe:
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
It is used to indicate if the container is ready to serve traffic or not i.e.proof of being ready to use.
It checks dependencies like database connections or other services your container is depending on to fulfill its work.
In the given example, until the request returns Success, it won't serve any traffic(by removing the Pod’s IP address from the endpoints of all Services that match the Pod).
Kubernetes relies on the readiness probes during rolling updates, it keeps the old container up and running until the new service declares that it is ready to take traffic.
If not provided the default state is Success.
startupProbe:
startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
It is used to indicate if the application inside the Container has started.
If a startup probe is provided, all other probes are disabled.
In the given example, if the request fails, it will restart the container.
Once the startup probe has succeeded once, the liveness probe takes over to provide a fast response to container deadlocks.
If not provided the default state is Success.
Check K8S documenation for more.
I think the below table describes the use-cases for each.
Feature
Readiness Probe
Liveness Probe
Startup Probe
Exmine
Indicates whether the container is ready to service requests.
Indicates whether the container is running.
Indicates whether the application within the container is started.
On Failure
If the readiness probe fails, the endpoints controller removes the pod's IP address from the endpoints of all services that match the pod.
If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy.
If the startup probe fails, the kubelet kills the container, and the container is subjected to its restart policy.
Default Case
The default state of readiness before the initial delay is Failure. If a container does not provide a readiness probe, the default state is Success.
If a container does not provide a liveness probe, the default state is Success.
If a container does not provide a startup probe, the default state is Success.
Sources:
Kubernetes in Action
Here's a concrete example of one we're using in our app. It has a single crude HTTP healthcheck, accessible on http://hostname:8080/management/health.
ports:
- containerPort: 8080
name: http-traffic
App Initialization (startup)
Spring app that is slow to start - anywhere between 30-120 seconds.
Don't want other probes to run until app is started.
Check it every 10 seconds for up to 180s before k8s gets into a crash loop.
startupProbe:
successThreshold: 1
failureThreshold: 18
periodSeconds: 10
timeoutSeconds: 5
httpGet:
path: /management/health
port: web-traffic
Healthcheck (readiness)
Ping the app every 10 seconds to make sure it's healthy (ie. accepting HTTP requests).
If fail two subsequent pings, cordone it off (prevents cascades).
Must pass two subsequent health checks before can accept traffic again.
readinessProbe:
successThreshold: 2
failureThreshold: 2
periodSeconds: 10
timeoutSeconds: 5
httpGet:
path: /management/health
port: web-traffic
App has Died (liveliness)
If app fails 3 consecutive health checks, 30 seconds apart, reboot the container. Maybe app got into an unrecoverable state like Java ran out of heap memory.
livenessProbe:
successThreshold: 1
failureThreshold: 3
periodSeconds: 30
timeoutSeconds: 5
httpGet:
path: /management/health
port: web-traffic

Liveness-Probe of one pod via another

On my Kubernetes Setup, I have 2 pods - A (via deployment) and B(via DS).
Pod B is somehow dependent on Pod A being fully started through. I would now like to set an HTTP Liveness-Probe in Pods B, to restart POD B if health check via POD A fails. Restarting works fine if I put the External IP of my POD A's service in the host. The issue is in resolving DNS name in the host.
It works if I set it like this:
livenessProbe:
httpGet:
host: <POD_A_SERVICE_EXTERNAL_IP_HERE>
path: /health
port: 8000
Fails if I set it like this:
livenessProbe:
httpGet:
host: auth
path: /health
port: 8000
Failed with following error message:
Liveness probe failed: Get http://auth:8000/health: dial tcp: lookup auth on 8.8.8.8:53: no such host
ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Is the following line on the above page true for HTTP Probes as well?
"you can not use a service name in the host parameter since the kubelet is unable to resolve it."
Correct 👍, DNS doesn't work for liveness probes, the kubelet network space cannot basically resolve any in-cluster DNS.
You can consider putting both of your services in a single pod as sidecars. This way they would share the same address space if one container fails then the whole pod is restarted.
Another option is to create an operator 🔧 for your pods/application and basically have it check the liveness through the in-cluster DNS for both pods separately and restart the pods through the Kubernetes API.
You can also just create your own script in a pod that just calls curl to check for a 200 OK and kubectl to restart your pod if you get something else.
Note that for the 2 options above you need to make sure that Coredns is stable and solid otherwise your health checks might fail to make your services have potential downtime.
✌️☮️

k8s - livenessProbe vs readinessProbe

Consider a pod which has a healthcheck setup via a http endpoint /health at port 80 and it takes almost 60 seconds to be actually ready & serve the traffic.
readinessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 60
livenessProbe:
httpGet:
path: /health
port: 80
Questions:
Is my above config correct for the given requirement?
Does liveness probe start working only after the pod becomes ready ? In other words, I assume readiness probe job is complete once the POD is ready. After that livenessProbe takes care of health check. In this case, I can ignore the initialDelaySeconds for livenessProbe. If they are independent, what is the point of doing livenessProbe check when the pod itself is not ready! ?
Check this documentation. What do they mean by
If you want your Container to be able to take itself down for
maintenance, you can specify a readiness probe that checks an endpoint
specific to readiness that is different from the liveness probe.
I was assuming, the running pod will take itself down only if the livenessProbe fails. not the readinessProbe. The doc says other way.
Clarify!
I'm starting from the second problem to answer. The second question is:
Does liveness probe start working only after the pod becomes ready?
In other words, I assume readiness probe job is complete once the POD
is ready. After that livenessProbe takes care of health check.
Our initial understanding is that liveness probe will start to check after readiness probe was succeeded but it turn out not to be like that. It has opened an issue for this challenge.Yon can look up to here. Then It was solved this problem by adding startup probes.
To sum up:
livenessProbe
livenessProbe: Indicates whether the Container is running. If the
liveness probe fails, the kubelet kills the Container, and the
Container is subjected to its restart policy. If a Container does not
provide a liveness probe, the default state is Success.
readinessProbe
readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod. The default state of readiness before the initial delay is Failure. If a Container does not provide a readiness probe, the default state is Success.
startupProbe
startupProbe: Indicates whether the application within the Container is started. All other probes are disabled if a startup probe is provided, until it succeeds. If the startup probe fails, the kubelet kills the Container, and the Container is subjected to its restart policy. If a Container does not provide a startup probe, the default state is Success
look up here.
The liveness probes are to check if the container is started and alive. If this isn’t the case, kubernetes will eventually restart the container.
The readiness probes in turn also check dependencies like database connections or other services your container is depending on to fulfill it’s work. As a developer you have to invest here more time into the implementation than just for the liveness probes. You have to expose an endpoint which is also checking the mentioned dependencies when queried.
Your current configuration uses a health endpoint which are usually used by liveness probes. It probably doesn’t check if your services is really ready to take traffic.
Kubernetes relies on the readiness probes. During a rolling update, it will keep the old container up and running until the new service declares that it is ready to take traffic. Therefore the readiness probes have to be implemented correctly.
I will show the difference between them in a couple of simple points:
livenessProbe
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
It is used to indicate if the container has started and is alive or not i.e. proof of being available.
In the given example, if the request fails, it will restart the container.
If not provided the default state is Success.
readinessProbe
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
It is used to indicate if the container is ready to serve traffic or not i.e.proof of being ready to use.
It checks dependencies like database connections or other services your container is depending on to fulfill its work.
In the given example, until the request returns Success, it won't serve any traffic(by removing the Pod’s IP address from the endpoints of all Services that match the Pod).
Kubernetes relies on the readiness probes during rolling updates, it keeps the old container up and running until the new service declares that it is ready to take traffic.
If not provided the default state is Success.
Summary
Liveness Probes: Used to check if the container is available and alive.
Readiness Probes: Used to check if the application is ready to be used and serve the traffic.
Both readiness probe and liveness probe seem to have same behavior. They do same type of checks. But the action they take in case of failures is different.
Readiness Probe shuts the traffic from service down. so that service can always the send the request to healthy pod whereas the liveness probe restarts the pod in case of failure. It does not do anything for the service. Service continues to send the request to the pods as usual if it is in ‘available’ status.
It is recommended to use both probes!!
Check here for detailed explanation with code samples.
The Kubernetes platform has capabilities for validating container applications, called healthchecks. Liveness is proof of availability and readness is proof of pod readiness is ready to use.
The features are designed to prevent service downtime and inconsistent images by enabling restarts when needed. Kubernetes uses liveness to know when to restart the container, so it can solve most problems. Kubernetes uses readness to know when the container is available to accept requests. The pod is considered ready when all containers are ready. Therefore, when the pod takes too long to initialize (by cache mount, DB schema, etc.) it is recommended to increase initialDelaySeconds.
I'd post it as a comment but it's too long, So let's make it a full answer.
Is my above config correct for the given requirement?
IMHO no, you are missing initialDelaySeconds for both probes and liveness and rediness probably should not call the same endpoint. I'd use the suggestionss form #fgul
Does liveness probe start working only after the pod becomes ready ?
In other words, I assume readiness probe job is complete once the POD
is ready. After that livenessProbe takes care of health check. In this
case, I can ignore the initialDelaySeconds for livenessProbe. If they
are independent, what is the point of doing livenessProbe check when
the pod itself is not ready! ?
I think you were thinking about startupProbe, again #fgul described what does what so there is no point in me repeating.
I was assuming, the running pod will take itself down only if the
livenessProbe fails. not the readinessProbe. The doc says other way.
The pod can be restarted only based on livenessProbe, not the redinessProbe.
I'd think twice before binding a rediness probe with external services (being alive as #randy advised), especially in high load services:
Let's assume you have define a deployment with lots of pods, that are connecting to a database and are processing lots of requests.
Now the database goes down.
The rediness probe is checking also db connection and it marks all of the pods as "out of service".
Now the db goes up.
Pods rediness probe will start to pass but not instantly and on all pods right away - the pods will be marked as "Ready" one after an other.
But it might be too slow - the second the first pod will be marked as ready, ALL of the traffic will be sent to this one pod alone. It might end in a situation that the "waking up" pods will be killed by the traffic one after an other.
For that kind of situation I'd say the rediness pod should check only pod internal stuff and don't care about the externall services. The kubernetes endpoint will return an error and either the clients might support failing service (it's called "designed for failure") or the loadbalancer/ingress can cover it.
I think the below image describes the use-cases for each.
Liveness probes are a relatively specialized tool, and you probably don't want one at all. However they run totally independently AFAIK.

Kubernetes: readinessProbes failing but the livelinessProbe is succeeding with the same settings

I have a livelinessProbe configured for my pod which does a http-get on path on the same pod and a particular port. It works perfectly. But, if I use the same settings and configure a readinessProbe it fails with the below error.
Readiness probe failed: wsarecv: read tcp :50578->:80: An existing connection was forcibly closed by the remote host
Actually after certain point I even see the liveliness probes failing. not sure why . Liveliness probe succeeding should indicate that the kube-dns is working fine and we're able to reach the pod from the node. Here's the readinessProbe for my pod's spec
readinessProbe:
httpGet:
path: /<path> # -> this works for livelinessProbe
port: 80
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 10
Does anyone have an idea what might be going on here.
I don't think it has anything to do with kube-dns or coredns. The most likely cause here is that your pod/container/application is crashing or stop serving requests.
Seems like this timeline:
Pod/container comes up.
Liveliness probe passes ok.
Some time passes.
Probably app crash or error.
Readiness fails.
Liveliness probe fails too.
More information about what that error means here:
An existing connection was forcibly closed by the remote host