Why kubernetes reports "readiness probe failed" along with "liveness probe failed" - kubernetes

I have a working Kubernetes deployment of my application.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
...
template:
...
spec:
containers:
- name: my-app
image: my-image
...
readinessProbe:
httpGet:
port: 3000
path: /
livenessProbe:
httpGet:
port: 3000
path: /
When I apply my deployment I can see it runs correctly and the application responds to my requests.
$ kubectl describe pod -l app=my-app
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m7s default-scheduler Successfully assigned XXX
Normal Pulled 4m5s kubelet, pool-standard-4gb-2cpu-b9vc Container image "my-app" already present on machine
Normal Created 4m5s kubelet, pool-standard-4gb-2cpu-b9vc Created container my-app
Normal Started 4m5s kubelet, pool-standard-4gb-2cpu-b9vc Started container my-app
The application has a defect and crashes under certain circumstances. I "invoke" such a condition and then I see the following in pod events:
$ kubectl describe pod -l app=my-app
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m45s default-scheduler Successfully assigned XXX
Normal Pulled 6m43s kubelet, pool-standard-4gb-2cpu-b9vc Container image "my-app" already present on machine
Normal Created 6m43s kubelet, pool-standard-4gb-2cpu-b9vc Created container my-app
Normal Started 6m43s kubelet, pool-standard-4gb-2cpu-b9vc Started container my-app
Warning Unhealthy 9s kubelet, pool-standard-4gb-2cpu-b9vc Readiness probe failed: Get http://10.244.2.14:3000/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 4s (x3 over 14s) kubelet, pool-standard-4gb-2cpu-b9vc Liveness probe failed: Get http://10.244.2.14:3000/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Normal Killing 4s kubelet, pool-standard-4gb-2cpu-b9vc Container crawler failed liveness probe, will be restarted
It is expected the liveness probe fails and the container is restarted. But why do I see Readiness probe failed event?

As #suren wrote in the comment, readiness probe is still executed after container is started. Thus if both liveness and readiness probes are defined (and also fx they are the same), both readiness and liveness probe can fail.
Here is a similar question with a clear in-depth answer.

The readiness probe is used to determine if the container is ready to serve requests. Your container can be running but not passing the probe. If it doesn't pass the check no service will redirect to this container.
By default the period of the readiness probe is 10 seconds.
You can read more here : https://docs.openshift.com/container-platform/3.9/dev_guide/application_health.html

You configured the same check for readiness and liveness probe - therefore if the liveness check fails, it can be assumed that the readiness fails as well.

please provide an implementation function/method at backend, you can make /health named uri, and can write a liveness logic here and readiness can be your choice too.
/health uri, shall be associated with a function implementation which will can return 200 status code if everything goes fine, else it can be made to get failed

Related

Ingress-nginx is in CrashLoopBackOff after K8s upgrade

After upgrading Kubernetes node pool from 1.21 to 1.22, ingress-nginx-controller pods started crashing. The same deployment has been working fine in EKS. I'm just having this issue in GKE. Does anyone have any ideas about the root cause?
$ kubectl logs ingress-nginx-controller-5744fc449d-8t2rq -c controller
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v1.3.1
Build: 92534fa2ae799b502882c8684db13a25cde68155
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.10
-------------------------------------------------------------------------------
W0219 21:23:08.194770 8 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0219 21:23:08.194995 8 main.go:209] "Creating API client" host="https://10.1.48.1:443"
Ingress pod events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned infra/ingress-nginx-controller-5744fc449d-8t2rq to gke-infra-nodep-ffe54a41-s7qx
Normal Pulling 27m kubelet Pulling image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974"
Normal Started 27m kubelet Started container controller
Normal Pulled 27m kubelet Successfully pulled image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974" in 6.443361484s
Warning Unhealthy 26m (x6 over 26m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 502
Normal Killing 26m kubelet Container controller failed liveness probe, will be restarted
Normal Created 26m (x2 over 27m) kubelet Created container controller
Warning FailedPreStopHook 26m kubelet Exec lifecycle hook ([/wait-shutdown]) for Container "controller" in Pod "ingress-nginx-controller-5744fc449d-8t2rq_infra(c4c166ff-1d86-4385-a22c-227084d569d6)" failed - error: command '/wait-shutdown' exited with 137: , message: ""
Normal Pulled 26m kubelet Container image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974" already present on machine
Warning BackOff 7m7s (x52 over 21m) kubelet Back-off restarting failed container
Warning Unhealthy 2m9s (x55 over 26m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 502
The Beta API versions (extensions/v1beta1 and networking.k8s.io/v1beta1) of Ingress are no longer served (removed) for GKE clusters created on versions 1.22 and later. Please refer to the official GKE ingress documentation for changes in the GA API version.
Also refer to Official Kubernetes documentation for API removals for Kubernetes v1.22 for more information.
Before upgrading your Ingress API as a client, make sure that every ingress controller that you use is compatible with the v1 Ingress API. See Ingress Prerequisites for more context about Ingress and ingress controllers.
Also check below possible causes for Crashloopbackoff :
Increasing the initialDelaySeconds value for the livenessProbe setting may help to alleviate the issue, as it will give the container more time to start up and perform its initial work operations before the liveness probe server checks its health.
Check “Container restart policy”, the spec of a Pod has a restartPolicy field with possible values Always, OnFailure, and Never. The default value is Always.
Out of memory or resources : Try to increase the VM size. Containers may crash due to memory limits, then new ones spun up, the health check failed and Ingress served up 502.
Check externalTrafficPolicy=Local is set on the NodePort service will prevent nodes from forwarding traffic to other nodes.
Refer to the Github issue Document how to avoid 502s #34 for more information.

Why am I seeing LoadBalancerNegNotReady when this Deployment has no services or ingresses connected to it?

The Pod in my Deployment takes 20+ minutes to be marked as healthy every time there is a new revision.
The logs show the ready and live endpoints responding with 200.
The application is not throwing errors.
Doing a describe on the pod I see this:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal LoadBalancerNegNotReady 52s neg-readiness-reflector Waiting for pod to become healthy in at least one of the NEG(s): [k8s1-mypod]
Normal Scheduled 52s default-scheduler Successfully assigned mypod to mycluster
Normal Pulling 51s kubelet Pulling image "gcr.io/myproject/myimage"
Normal Pulled 51s kubelet Successfully pulled image "gcr.io/myproject/myimage" in 196.170923ms
Normal Created 51s kubelet Created container custom-metrics
Normal Started 51s kubelet Started container custom-metrics
Warning Unhealthy 50s kubelet Readiness probe failed: Get "http://10.123.123.123:80/api/ready": dial tcp 10.123.123.123:80: connect: connection refused
I know there is an initial fail of the probe, but it starts responding healthy very quickly after that as I see in the logs.
I'm at a loss. Do I need to set initialDelaySeconds or something? I don't see how that would change anything.
These are my healthcheck settings:
livenessProbe:
timeoutSeconds: 30
httpGet:
port: 80
path: /api/live
readinessProbe:
timeoutSeconds: 30
httpGet:
port: 80
path: /api/ready

kubernetes cannot pull a public image

kubernetes cannot pull a public image. Standard images like nginx are downloading successfully, but my pet project is not downloading. I'm using minikube for launch kubernetes-cluster
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway-deploumnet
labels:
app: api-gateway
spec:
replicas: 3
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
containers:
- name: api-gateway
image: creatorsprodhouse/api-gateway:latest
imagePullPolicy: Always
ports:
- containerPort: 80
when I try to create a deployment I get an error that kubernetes cannot download my public image.
$ kubectl get pods
result:
NAME READY STATUS RESTARTS AGE
api-gateway-deploumnet-599c784984-j9mf2 0/1 ImagePullBackOff 0 13m
api-gateway-deploumnet-599c784984-qzklt 0/1 ImagePullBackOff 0 13m
api-gateway-deploumnet-599c784984-csxln 0/1 ImagePullBackOff 0 13m
$ kubectl logs api-gateway-deploumnet-599c784984-csxln
result
Error from server (BadRequest): container "api-gateway" in pod "api-gateway-deploumnet-86f6cc5b65-xdx85" is waiting to start: trying and failing to pull image
What could be the problem? The standard images are downloading but my public one is not. Any help would be appreciated.
EDIT 1
$ api-gateway-deploumnet-599c784984-csxln
result:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m22s default-scheduler Successfully assigned default/api-gateway-deploumnet-849899786d-mq4td to minikube
Warning Failed 3m8s kubelet Failed to pull image "creatorsprodhouse/api-gateway:latest": rpc error: code = Unknown desc = context deadline exceeded
Warning Failed 3m8s kubelet Error: ErrImagePull
Normal BackOff 3m7s kubelet Back-off pulling image "creatorsprodhouse/api-gateway:latest"
Warning Failed 3m7s kubelet Error: ImagePullBackOff
Normal Pulling 2m53s (x2 over 8m21s) kubelet Pulling image "creatorsprodhouse/api-gateway:latest"
EDIT 2
If I try to download a separate docker image, it's fine
$ docker pull creatorsprodhouse/api-gateway:latest
result:
Digest: sha256:e664a9dd9025f80a3dd60d157ce1464d4df7d0f8a00538e6a137d44f9f9f12aa
Status: Downloaded newer image for creatorsprodhouse/api-gateway:latest
docker.io/creatorsprodhouse/api-gateway:latest
EDIT 3
After advice to restart minikube
$ minikube stop
$ minikube delete --purge
$ minikube start --cni=calico
I started the pods.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m28s default-scheduler Successfully assigned default/api-gateway-deploumnet-849899786d-bkr28 to minikube
Warning FailedCreatePodSandBox 4m27s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "7e112c92e24199f268ec9c6f3a6db69c2572c0751db9fd57a852d1b9b412e0a1" network for pod "api-gateway-deploumnet-849899786d-bkr28": networkPlugin cni failed to set up pod "api-gateway-deploumnet-849899786d-bkr28_default" network: failed to set bridge addr: could not add IP address to "cni0": permission denied, failed to clean up sandbox container "7e112c92e24199f268ec9c6f3a6db69c2572c0751db9fd57a852d1b9b412e0a1" network for pod "api-gateway-deploumnet-849899786d-bkr28": networkPlugin cni failed to teardown pod "api-gateway-deploumnet-849899786d-bkr28_default" network: running [/usr/sbin/iptables -t nat -D POSTROUTING -s 10.85.0.34 -j CNI-57e7da7379b524635074e6d0 -m comment --comment name: "crio" id: "7e112c92e24199f268ec9c6f3a6db69c2572c0751db9fd57a852d1b9b412e0a1" --wait]: exit status 2: iptables v1.8.4 (legacy): Couldn't load target `CNI-57e7da7379b524635074e6d0':No such file or directory
Try `iptables -h' or 'iptables --help' for more information.
I could not solve the problem in the ways I was suggested. However, it worked when I ran minikube with a different driver
$ minikube start --driver=none
--driver=none means that the cluster will run on your host instead of the standard --driver=docker which runs the cluster in docker.
It is better to run minikube with --driver=docker as it is safer and easier, but it didn't work for me as I could not download my images. For me personally it is ok to use --driver=none although it is a bit dangerous.
In general, if anyone knows what the problem is, please answer my question. In the meantime you can try to run minikube cluster on your host with the command I mentioned above.
In any case, thank you very much for your attention!

Are healthchecks defined per container or per pod in Kubernetes?

In Google Cloud blog they say that if Readiness probe fails, then traffic will not be routed to a pod. And if Liveliness probe fails, a pod will be restarted.
Kubernetes docs they say that the kubelet uses Liveness probes to know if a container needs to be restarted. And Readiness probes are used to check if a container is ready to start accepting requests from clients.
My current understanding is that a pod is considered Ready and Alive when all of its containers are ready. This in turn implies that if 1 out of 3 containers in a pod fails, then the entire pod will be considered as failed (not Ready / not Alive). And if 1 out of 3 containers was restarted, then it means that the entire pod was restarted. Is this correct?
A Pod is ready only when all of its containers are ready.
When a Pod is ready, it should be added to the load balancing pools of all matching Services because it means that this Pod is able to serve requests.
As you can see in the Readiness Probe documentation:
The kubelet uses readiness probes to know when a container is ready to start accepting traffic.
Using readiness probe can ensure that traffic does not reach a container that is not ready for it.
Using liveness probe can ensure that container is restarted when it fail ( the kubelet will kill and restart only the specific container).
Additionally, to answer your last question, I will use an example:
And if 1 out of 3 containers was restarted, then it means that the entire pod was restarted. Is this correct?
Let's have a simple Pod manifest file with livenessProbe for one container that always fails:
---
# web-app.yml
apiVersion: v1
kind: Pod
metadata:
labels:
run: web-app
name: web-app
spec:
containers:
- image: nginx
name: web
- image: redis
name: failed-container
livenessProbe:
httpGet:
path: /healthz # I don't have this endpoint configured so it will always be failed.
port: 8080
After creating web-app Pod and waiting some time, we can check how the livenessProbe works:
$ kubectl describe pod web-app
Name: web-app
Namespace: default
Containers:
web:
...
State: Running
Started: Tue, 09 Mar 2021 09:56:59 +0000
Ready: True
Restart Count: 0
...
failed-container:
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Ready: False
Restart Count: 7
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Normal Killing 9m40s (x2 over 10m) kubelet Container failed-container failed liveness probe, will be restarted
...
As you can see, only the failed-container container was restarted (Restart Count: 7).
More information can be found in the Liveness, Readiness and Startup Probes documentation.
For Pods with multiple containers, we do have an option to restart only single containers conditions applied it have required access.
Command :
kubectl exec POD_NAME -c CONTAINER_NAME "Command used for restarting the container"
Such that required POD is not deleted and k8s doesn't need to recreate the POD.

K8s liveness probe behavior when the pod contains more than one container?

Scenario: A K8S pod has more than one container and liveness/readiness probes are configured for each of the containers. Now if the liveness probe is succeeding on some containers and failing on few containers, what will k8s do.
will it restart only the failing containers
OR
will it restart the entire pod.
if the liveness probe is succeeding on some containers and failing on few containers, what will k8s do?
It will restart only the failing containers.
In Pod Lifecycle - Container Probes you have listed all 3 probes: liviness, readiness and startup.
livenessProbe: Indicates whether the container is running. If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy. If a Container does not provide a liveness probe, the default state is Success.
In Configure Liveness, Readiness and Startup Probes - Define a liveness command you have example and it's mentioned that:
If the command succeeds, it returns 0, and the kubelet considers the container to be alive and healthy. If the command returns a non-zero value, the kubelet kills the container and restarts it.
The same situation is in case HTTP request liveness probe:
If the handler for the server's /healthz path returns a success code, the kubelet considers the container to be alive and healthy. If the handler returns a failure code, the kubelet kills the container and restarts it.
And with TCP liveness probe:
The kubelet will run the first liveness probe 15 seconds after the container starts. Just like the readiness probe, this will attempt to connect to the goproxy container on port 8080. If the liveness probe fails, the container will be restarted.
Tests
If you would like create own test you can use this example of HTTP Liveness probe:
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-http-probe
spec:
containers:
- name: liveness
image: k8s.gcr.io/liveness
args:
- /server
readinessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
- name: nginx
image: nginx
After a while you will be able to see that the container was restarted, restart count was increased, but the pod is still existing as Age is still counting.
$ kubectl get po -w
NAME READY STATUS RESTARTS AGE
liveness-http-probe 2/2 Running 0 20s
liveness-http-probe 1/2 Running 0 23s
liveness-http-probe 1/2 Running 1 42s
liveness-http-probe 2/2 Running 1 43s
liveness-http-probe 1/2 Running 1 63s
...
liveness-http-probe 1/2 Running 5 3m23s
liveness-http-probe 2/2 Running 5 3m23s
liveness-http-probe 1/2 Running 5 3m43s
liveness-http-probe 1/2 CrashLoopBackOff 5 4m1s
liveness-http-probe 1/2 Running 6 5m25s
liveness-http-probe 2/2 Running 6 5m28s
liveness-http-probe 1/2 Running 6 5m48s
liveness-http-probe 1/2 CrashLoopBackOff 6 6m2s
liveness-http-probe 1/2 Running 7 8m46s
liveness-http-probe 2/2 Running 7 8m48s
...
liveness-http-probe 2/2 Running 11 21m
liveness-http-probe 1/2 Running 11 21m
liveness-http-probe 1/2 CrashLoopBackOff 11 22m
liveness-http-probe 1/2 Running 12 27m
...
liveness-http-probe 1/2 Running 13 28m
liveness-http-probe 1/2 CrashLoopBackOff 13 28m
And in the pod description you will see duplicate Warnings like (x8 over 28m), (x84 over 24m) or (x2 over 28m).
Normal Pulling 28m (x2 over 28m) kubelet Pulling image "k8s.gcr.io/liveness"
Normal Killing 28m kubelet Container liveness failed liveness probe, will be restarted
Normal Started 28m (x2 over 28m) kubelet Started container liveness
Normal Created 28m (x2 over 28m) kubelet Created container liveness
Normal Pulled 28m kubelet Successfully pulled image "k8s.gcr.io/liveness" in 561.418121ms
Warning Unhealthy 27m (x8 over 28m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 27m (x4 over 28m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 500
Normal Pulled 13m (x2 over 14m) kubelet (combined from similar events): Successfully pulled image "k8s.gcr.io/liveness" in 508.892628ms
Warning BackOff 3m45s (x84 over 24m) kubelet Back-off restarting failed container
Lately I did some tests with liveness and readiness probes in thread - Liveness Probe, Readiness Probe not called in expected duration. It can provide you additional information.
Will restart the container.
As far the k8s doc:
The kubelet uses readiness probes to know when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready.
To perform a probe, the kubelet sends an HTTP GET request to the server that is running in the container and listening on port 8080. If the handler for the server's /healthz path returns a success code, the kubelet considers the container to be alive and healthy. If the handler returns a failure code, the kubelet kills the container and restarts it.
Whilst a Pod is running, the kubelet is able to restart containers to handle some kind of faults. Within a Pod, Kubernetes tracks different container states and determines what action to take to make the Pod healthy again.
You can see the pod events to see whether container restarted or not.
Ref: k8s doc and probes