LivenessProbe command for a background process - kubernetes

What is an appropriate Kubernetes livenessProbe command for a background process?
We have a NodeJS process that consumes messages off an SQS queue. Since it's a background job we don't expose any HTTP endpoints and so a liveness command seems to be the more appropriate way to do the liveness check. What would a "good enough" command setup look like that actually checks the process is alive and running properly? Should the NodeJS process touch a file to update its editted time and the liveness check validate that? Examples I've seen online seem disconnected to the actual process, e.g. they check a file exists.

You could use liveness using exec command.
Here is an example:
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
To perform a probe, the kubelet executes the command cat /tmp/healthy in the target container. If the command succeeds, it returns 0, and the kubelet considers the container to be alive and healthy. If the command returns a non-zero value, the kubelet kills the container and restarts it.

Related

Init container not restarting on pod restart

I have an init container that do some stuff that needs for the main container to run correctly, like creating some directories and a liveness probe that may fail if one of these directories were deleted. When the pod is restarted due to fail of liveness probe I expect that init container is also being restarted, but it won't.
This is what kubernetes documentation says about this:
If the Pod restarts, or is restarted, all init containers must execute again.
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
Easiest way to prove this behavior was to use the example of the pod from k8s documentation, add a liveness probe that always fails and expect that init container to be restarted, but again, it is not behaving as expected.
This is the example I'm working with:
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app: myapp
spec:
restartPolicy: Always
containers:
- name: myapp-container
image: busybox:1.28
command: ['sh', '-c', 'echo "App started at $(date)" && tail -f /dev/null']
livenessProbe:
exec:
command:
- sh
- -c
- exit 1
initialDelaySeconds: 1
periodSeconds: 1
initContainers:
- name: myapp-init
image: busybox:1.28
command: ['/bin/sh', '-c', 'sleep 5 && echo "Init container started at $(date)"']
Sleep and date command are there to confirm that init container was restarted.
The pod is being restarted:
NAME READY STATUS RESTARTS AGE
pod/myapp-pod 1/1 Running 4 2m57s
But from logs it's clear that init container don't:
$ k logs pod/myapp-pod myapp-init
Init container started at Thu Jun 16 12:12:03 UTC 2022
$ k logs pod/myapp-pod myapp-container
App started at Thu Jun 16 12:14:20 UTC 2022
I checked it on both v1.19.5 and v1.24.0 kubernetes servers.
The question is how to force the init container to restart on pod restart.
The restart number refers to container restarts, not pod restarts.
init container need to run only once in a pos lifetime, and you need to design your containers like that, you can read this PR, and especially this comment

External dependency in Pod Readiness and Liveness

I am new for pod health check with Readiness and Liveness. Recently I am working on Readiness. The scenario is as following:
The pod is a RestAPI service, it needs to connect to Database and store information in DB. So if RestAPI service wants to offer service, it needs to make sure database connection is successfully.
Si in our pod Readiness logic implementation, we use HTTP-Get and check if DB connection is connected, if it is okay, then HTTP-Get returns Ok, otherwise Readiness will be failed.
Not sure if the above logic is reasonable? Or is there any other approach for this logic processing?
Apart from Readiness, how about Liveness? Do I need to check DB connection in order to check Liveness is okay?
Any idea and suggestion are appreciated
readiness and liveness is mostly for service you are running inside container, there could be a scenario where your DB is up but there is issue with the application at that time also your readiness will be Up as DB is running, in ideal scenario if application not working it should stop accepting traffic.
i would recommend using the Init container or Life cycle hook for checking the condition of the Database first if it's up process will move ahead and your application or deployment will come into the picture.
If the application works well your readiness and liveness will HTTP-OK and the service start accepting traffic.
init container example
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app: myapp
spec:
containers:
- name: myapp-container
image: busybox
command: ['sh', '-c', 'echo The app is running! && sleep 3600']
initContainers:
- name: init-myservice
image: busybox
command: ['sh', '-c', 'until nslookup myservice; do echo waiting for myservice; sleep 2; done;']
- name: init-mydb
image: busybox
command: ['sh', '-c', 'until nslookup mydb; do echo waiting for mydb; sleep 2; done;']
Extra Notes
There is actually no need to check the DB readiness at all.
As your application will be trying to connect with the Database so if DB is not UP your application won't respond HTTP-OK so your application won't start, and readiness keep failing for application.
As soon as your Database comes up your application will create a successful connection with DB and it will give 200 responses and readiness will mark POD ready.
there is no extra requirement to setup the readiness for Db and based on that start POD.

How can a failed Kubernetes Ceph node be deleted automatically?

On an environment with more than one node and using Ceph block volumes in RWO mode, if a node fails (is unreachable and will not come back soon) and the pod is rescheduled to another node, the pod can't start if it has a Ceph block PVC. The reason is that the volume is 'still being used' by the other pod (because as the node failed, its resources can't be removed properly).
If I remove the node from the cluster using kubectl delete node dead-node, the pod can start because the resources get removed.
How can I do this automatically? Some possibilities I have thought about are:
Can I set a force detach timeout for the volume?
Set a delete node timeout?
Automatically delete a node with given taints?
I can use the ReadWriteMany mode with other volume types to be able to let the PV be used by more than one pod, but it is not ideal.
You can probably have a sidecar container and tweak the Readiness and Liveness probes in your pod so that the pod doesn't restart if a Ceph block volume is unreachable for some time by the container that it's using it. (There may be other implications to your application though)
Something like this:
apiVersion: v1
kind: Pod
metadata:
labels:
test: ceph
name: ceph-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
- name: cephclient
image: ceph
volumeMounts:
- name: ceph
mountPath: /cephmountpoint
livenessProbe:
... 👈 something
initialDelaySeconds: 5
periodSeconds: 3600 👈 make this real long
✌️☮️

How to define k8s liveness probe and readness probe for worker pod

I have a k8s cluster. Our service is queue based. Our pod subscribe to an event queue,fetch event and do tasks. So for this kind of service, how to define k8s liveness probe and readiness probe?
Following is a very brief introduction to these probes:
Liveliness Probe is for the Kubernetes to know if a workload is healthy. It could be a shell command executed in your container or a simple tcp/http request which should respond positively.
If a liveliness check fails after a period of timeout which is specified in the pod config, Kubrenetes will restart the workload.
So, if your workload is doing time consuming processes, you might need to give your liveliness probe enough time to make sure that your pod is not restarted unduely.
Rediness Probe is for the Kubernetes proxy to decide if your workload is ready for consuming traffic. The traffic will be sent to your pod only if the rediness probe responds positively. So, if your workload needs more time processing a single request and needs other requests to be diverted to other replicas for fast processing during this time, you might want to give a slightly high rediness interval to the workloads.
These probe parameters, combined with number of replicas can ensure fast and healthy functioning of your application. It is very important to understand the area each probe cover and the parameters you can tune them by.
Here are some reads:
https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Assuming your problem is because is a processing worker consuming queue messages it doesn't expose any port to check.
In that case, you can define livenessProbe and readinessProbe custom command, next is an example from the documnetation:
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
Also, take in mind the time your process takes to be live and ready to adjust the initialDelaySeconds and periodSeconds to not kill the pod before it is fully loeaded.

Pod failure and recovery events

We are listening to multiple mailboxes on a single pod but if this pod goes down due to some reason need the other pod that is up to listen to these mailboxes. In order to keep recieving emails.
I would like to know if it is possible to find if a pod goes down like an event and trigger a script to perform above action on the go?
Approach 1:
kubernetes life cycle handler hook
apiVersion: v1
kind: Pod
metadata:
name: lifecycle-demo
spec:
containers:
- name: lifecycle-demo-container
image: nginx
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]
preStop:
exec:
command: ["/bin/sh","-c","nginx -s quit; while killall -0 nginx; do sleep 1; done"]
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/
Approach2:
Write a script which monitors the health of for every say x seconds, when 3 consecutive health checks fail kubernetes deletes the pod. so in your script, if 3 consecutive rest call fails for health then the pod is deleted. you can trigger your event.
Approach3:
maintain 2 replicas => problem could be two pods processing same mail. you can avoid this if you use kafka.