How can a failed Kubernetes Ceph node be deleted automatically? - kubernetes

On an environment with more than one node and using Ceph block volumes in RWO mode, if a node fails (is unreachable and will not come back soon) and the pod is rescheduled to another node, the pod can't start if it has a Ceph block PVC. The reason is that the volume is 'still being used' by the other pod (because as the node failed, its resources can't be removed properly).
If I remove the node from the cluster using kubectl delete node dead-node, the pod can start because the resources get removed.
How can I do this automatically? Some possibilities I have thought about are:
Can I set a force detach timeout for the volume?
Set a delete node timeout?
Automatically delete a node with given taints?
I can use the ReadWriteMany mode with other volume types to be able to let the PV be used by more than one pod, but it is not ideal.

You can probably have a sidecar container and tweak the Readiness and Liveness probes in your pod so that the pod doesn't restart if a Ceph block volume is unreachable for some time by the container that it's using it. (There may be other implications to your application though)
Something like this:
apiVersion: v1
kind: Pod
metadata:
labels:
test: ceph
name: ceph-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
- name: cephclient
image: ceph
volumeMounts:
- name: ceph
mountPath: /cephmountpoint
livenessProbe:
... 👈 something
initialDelaySeconds: 5
periodSeconds: 3600 👈 make this real long
✌️☮️

Related

Init container not restarting on pod restart

I have an init container that do some stuff that needs for the main container to run correctly, like creating some directories and a liveness probe that may fail if one of these directories were deleted. When the pod is restarted due to fail of liveness probe I expect that init container is also being restarted, but it won't.
This is what kubernetes documentation says about this:
If the Pod restarts, or is restarted, all init containers must execute again.
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
Easiest way to prove this behavior was to use the example of the pod from k8s documentation, add a liveness probe that always fails and expect that init container to be restarted, but again, it is not behaving as expected.
This is the example I'm working with:
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app: myapp
spec:
restartPolicy: Always
containers:
- name: myapp-container
image: busybox:1.28
command: ['sh', '-c', 'echo "App started at $(date)" && tail -f /dev/null']
livenessProbe:
exec:
command:
- sh
- -c
- exit 1
initialDelaySeconds: 1
periodSeconds: 1
initContainers:
- name: myapp-init
image: busybox:1.28
command: ['/bin/sh', '-c', 'sleep 5 && echo "Init container started at $(date)"']
Sleep and date command are there to confirm that init container was restarted.
The pod is being restarted:
NAME READY STATUS RESTARTS AGE
pod/myapp-pod 1/1 Running 4 2m57s
But from logs it's clear that init container don't:
$ k logs pod/myapp-pod myapp-init
Init container started at Thu Jun 16 12:12:03 UTC 2022
$ k logs pod/myapp-pod myapp-container
App started at Thu Jun 16 12:14:20 UTC 2022
I checked it on both v1.19.5 and v1.24.0 kubernetes servers.
The question is how to force the init container to restart on pod restart.
The restart number refers to container restarts, not pod restarts.
init container need to run only once in a pos lifetime, and you need to design your containers like that, you can read this PR, and especially this comment

LivenessProbe command for a background process

What is an appropriate Kubernetes livenessProbe command for a background process?
We have a NodeJS process that consumes messages off an SQS queue. Since it's a background job we don't expose any HTTP endpoints and so a liveness command seems to be the more appropriate way to do the liveness check. What would a "good enough" command setup look like that actually checks the process is alive and running properly? Should the NodeJS process touch a file to update its editted time and the liveness check validate that? Examples I've seen online seem disconnected to the actual process, e.g. they check a file exists.
You could use liveness using exec command.
Here is an example:
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
To perform a probe, the kubelet executes the command cat /tmp/healthy in the target container. If the command succeeds, it returns 0, and the kubelet considers the container to be alive and healthy. If the command returns a non-zero value, the kubelet kills the container and restarts it.

How can I ignore failure of a container in multi-container pod?

I have a multi-container application: app + sidecar. Both containers suppose to be alive all the time but sidecar is not really that important.
Sidecar depends on external resource, if this resource is not available - sidecar crashes. And it takes entire pod down. Kubernetes tries to recreate pod and fails because sidecar now won't start.
But from my business logic perspective - crash of sidecar is absolutely normal. Having that sidecar is nice but not mandatory.
I don't want sidecar to take main app with it when it crashes.
What would be best Kubernetes-native way to achieve that?
Is it possible to tell kubernetes ignore failure of sidecar as a "false positive" event which is absolutely fine?
I can't find anything in pod specification what controls that behaviour.
My yaml:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: myapp
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: myapp
spec:
volumes:
- name: logs-dir
emptyDir: {}
containers:
- name: myapp
image: ${IMAGE}
ports:
- containerPort: 9009
volumeMounts:
- name: logs-dir
mountPath: /usr/src/app/logs
resources:
limits:
cpu: "1"
memory: "512Mi"
readinessProbe:
initialDelaySeconds: 60
failureThreshold: 8
timeoutSeconds: 1
periodSeconds: 8
httpGet:
scheme: HTTP
path: /myapp/v1/admin-service/git-info
port: 9009
- name: graylog-sidecar
image: digiapulssi/graylog-sidecar:latest
volumeMounts:
- name: logs-dir
mountPath: /log
env:
- name: GS_TAGS
value: "[\"myapp\"]"
- name: GS_NODE_ID
value: "nodeid"
- name: GS_SERVER_URL
value: "${GRAYLOG_URL}"
- name: GS_LIST_LOG_FILES
value: "[\"/ctwf\"]"
- name: GS_UPDATE_INTERVAL
value: "10"
resources:
limits:
memory: "128Mi"
cpu: "0.1"
Warning: the answer that was flagged as "correct" does not appear to work.
Adding a Liveness Probe to the application container and setting Restart Policy to "Never", will lead to the Pod being stopped and never restarted in a scenario where the sidecar container has stopped and the application container has failed its Liveness Probe. This is a problem, since you DO want the restarts for the application container.
The problem should be solved as follows:
Tweak your sidecar container in the startup command to keep the main process running on failure of the application process. This could be done with an extra piece of scripting, e.g. by appending | tail -f /dev/null to the startup command.
Adding a Liveness Probe to the application container is in general a good idea. Keep in mind though that it only protects you against a scenario where your application process keeps running without your application being in a correct state. It will certainly not overwrite the restartPolicy:
livenessProbe: Indicates whether the container is running. If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy. If a Container does not provide a liveness probe, the default state is Success.
Container Probes
A custom livenessProbe should help but for your scenario I would use the liveness for your main app container which is the myapp. Considering the fact that you don't care about the sidecare (as mentioned). I would set the pod restartPolicy to Never and then define a custom livelinessProbe for your main myapp. In this way the Pod will never restart doesn't matter which container is failed but when your myapp container's liveliness fails kubelet will restart the container! Ref below, link
Pod is running and has two Containers. Container 1 exits with failure.
Log failure event. If restartPolicy is: Always: Restart Container; Pod
phase stays Running. OnFailure: Restart Container; Pod phase stays
Running. Never: Do not restart Container; Pod phase stays Running.
so the updated (pseudo) yaml should look like below
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: myapp
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
...
spec:
...
restartPolicy: Never
containers:
- name: myapp
...
livenessProbe:
exec:
command:
- /bin/sh
- -c
- {{ your custom liveliness check command goes }}
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
...
- name: graylog-sidecar
...
Note: since I don't know your application therefore I cannot write the command but for my jboss server I use this (an example for you)
livenessProbe:
exec:
command:
- /bin/sh
- -c
- /opt/jboss/wildfly/bin/jboss-cli.sh --connect --commands="read-attribute
server-state"
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
The best solution which works for me is not to fail inside a sidecar container, but just log an error and rerun.
#!/usr/bin/env bash
set -e
# do some stuff which can fail on start
set +e # needed to not exit if command fails
while ! command; do
echo "command failed - rerun"
done
This will always rerun the command if it fails, but exit if the command finished successfully.
You can define a custom livenessProbe for your sidecar to have greater failureThreshold / periodSeconds to accommodate what is considered acceptable failure rate in your environment, or simply ignore all failure.
Docs:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.15/#probe-v1-core
kubectl explain deployment.spec.template.spec.containers.livenessProbe

Is there a way to do a load balancing between pod in multiple nodes?

I have a kubernetes cluster deployed with rke witch is composed of 3 nodes in 3 different servers and in those server there is 1 pod which is running yatsukino/healthereum which is a personal modification of ethereum/client-go:stable .
The problem is that I'm not understanding how to add an external ip to send request to the pods witch are
My pods could be in 3 states:
they syncing the ethereum blockchain
they restarted because of a sync problem
they are sync and everything is fine
I don't want my load balancer to transfer requests to the 2 first states, only the third point consider my pod as up to date.
I've been searching in the kubernetes doc but (maybe because a miss understanding) I only find load balancing for pods inside a unique node.
Here is my deployment file:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: goerli
name: goerli-deploy
spec:
replicas: 3
selector:
matchLabels:
app: goerli
template:
metadata:
labels:
app: goerli
spec:
containers:
- image: yatsukino/healthereum
name: goerli-geth
args: ["--goerli", "--datadir", "/app", "--ipcpath", "/root/.ethereum/geth.ipc"]
env:
- name: LASTBLOCK
value: "0"
- name: FAILCOUNTER
value: "0"
ports:
- containerPort: 30303
name: geth
- containerPort: 8545
name: console
livenessProbe:
exec:
command:
- /bin/sh
- /app/health.sh
initialDelaySeconds: 20
periodSeconds: 60
volumeMounts:
- name: app
mountPath: /app
initContainers:
- name: healthcheck
image: ethereum/client-go:stable
command: ["/bin/sh", "-c", "wget -O /app/health.sh http://my-bash-script && chmod 544 /app/health.sh"]
volumeMounts:
- name: app
mountPath: "/app"
restartPolicy: Always
volumes:
- name: app
hostPath:
path: /app/
The answers above explains the concepts, but about your questions anout services and external ip; you must declare the service, example;
apiVersion: v1
kind: Service
metadata:
name: goerli
spec:
selector:
app: goerli
ports:
- port: 8545
type: LoadBalancer
The type: LoadBalancer will assign an external address for in public cloud or if you use something like metallb. Check your address with kubectl get svc goerli. If the external address is "pending" you have a problem...
If this is your own setup you can use externalIPs to assign your own external ip;
apiVersion: v1
kind: Service
metadata:
name: goerli
spec:
selector:
app: goerli
ports:
- port: 8545
externalIPs:
- 222.0.0.30
The externalIPs can be used from outside the cluster but you must route traffic to any node yourself, for example;
ip route add 222.0.0.30/32 \
nexthop via 192.168.0.1 \
nexthop via 192.168.0.2 \
nexthop via 192.168.0.3
Assuming yous k8s nodes have ip 192.168.0.x. This will setup ECMP routes to your nodes. When you make a request from outside the cluster to 222.0.0.30:8545 k8s will load-balance between your ready PODs.
For loadbalancing and exposing your pods, you can use https://kubernetes.io/docs/concepts/services-networking/service/
and for checking when a pod is ready, you can use tweak your liveness and readiness probes as explained https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
for probes you might want to consider exec actions like execution a script that checks what is required and returning 0 or 1 dependent on status.
When a container is started, Kubernetes can be configured to wait for a configurable
amount of time to pass before performing the first readiness check. After that, it
invokes the probe periodically and acts based on the result of the readiness probe. If a
pod reports that it’s not ready, it’s removed from the service. If the pod then becomes
ready again, it’s re-added.
Unlike liveness probes, if a container fails the readiness check, it won’t be killed or
restarted. This is an important distinction between liveness and readiness probes.
Liveness probes keep pods healthy by killing off unhealthy containers and replacing
them with new, healthy ones, whereas readiness probes make sure that only pods that
are ready to serve requests receive them. This is mostly necessary during container
start up, but it’s also useful after the container has been running for a while.
I think you can use probe for your goal

Monitor that service to pod iptables mappings are current

The problem occurred on kubernetes 1.2.3 but we are running 1.3.3 now.
We have had 2 situations where kube-proxy was running but was wedged and not updating iptables with the current state of services to pods. This led to a situation where traffic destined for serviceA got routed to pods that are part of serviceB. So we have improved our monitoring after the fact to query /healthz on the kube-proxy. I'm wondering if I should be monitoring anything beyond the existence of the kube-proxy process and that it's returning 200 from /healthz.
Are you monitoring anything additional to ensure that service to pod mappings are current. I realize that as the service landscape is changing we can have a period of time where all hosts may not be accurate but i'm only interested in catching the scenario where say 3+ minutes have gone by and iptables is not current on every node in the cluster which would seem to indicate to me that something is broken somewhere.
I had thought about doing something like having a canary service where the backing deployment get's redeployed every 5 minutes and then i verify from each node that I can get to all of the backing pods via the service cluster ip.
I'm not sure if this is the right approach. It would seem like it could catch the problem we had earlier but I'm also thinking some other simpler way may exist like just checking the time stamp on when iptables was last updated?
Thanks!
You could run kube-proxy inside a pod (by dropping a manifest inside /etc/kubernetes/manifests on each node), benefit from the health checking / liveness probes offered by Kubernetes, and let it take care of restarting the service for you in case of trouble.
Setting a very low threshold on the liveness probe will trigger a restart as soon as the /healthz endpoint takes too long to respond. It won't guarantee you that IPtables rules are always up-to-date, but will ensure that the kube-proxy is always healthy (which in turn will ensure IPtables rules are consistent)
Example:
Check the healthz endpoint of kube-proxy every 10s. Restart the pod if it doesn't respond in less than 1s:
apiVersion: v1
kind: Pod
metadata:
name: kube-proxy
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-proxy
image: gcr.io/google_containers/hyperkube:v1.3.4
command:
- /hyperkube
- proxy
- --master=https://master.kubernetes.io:6443
- --kubeconfig=/conf/kubeconfig
- --proxy-mode=iptables
livenessProbe:
httpGet:
path: /healthz
port: 10249
timeoutSeconds: 1
periodSeconds: 10
failureThreshold: 1
securityContext:
privileged: true
volumeMounts:
- mountPath: /conf/kubeconfig
name: kubeconfig
readOnly: true
- mountPath: /ssl/kubernetes
name: ssl-certs-kubernetes
readOnly: true
- mountPath: /etc/ssl/certs
name: ssl-certs-host
readOnly: true
volumes:
- hostPath:
path: /etc/kubernetes/proxy-kubeconfig.yml
name: kubeconfig
- hostPath:
path: /etc/kubernetes/ssl
name: ssl-certs-kubernetes
- hostPath:
path: /usr/share/ca-certificates
name: ssl-certs-host