Is it possible to fake a container to always be ready/live in kubernetes so that kubernetes thinks that the container is live and doesn't try to kill/recreate the container? I am looking for a quick and hacky solution, preferably.
Liveness and Readiness probes are not required by k8s controllers, you can simply remove them and your containers will be always live/ready.
If you want the hacky approach anyways, use the exec probe (instead of httpGet) with something dummy that always returns 0 as exit code. For example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
livenessProbe:
exec:
command:
- touch
- /tmp/healthy
readinessProbe:
exec:
command:
- touch
- /tmp/healthy
I'd like to add background contextual information about why
/ how this can be useful to real world applications.
Also by pointing out some additional info about why this question is useful I can come up with an even better answer.
First off why might you want to implement a fake startup / readiness / liveness probe?
Let's say you have a custom containerized application, you're in a rush so you go live without any liveness or readiness probes.
Scenario 1:
You have a deployment with 1 replica, but you notice that whenever you go to update your app (push a new version via a rolling update), your monitoring platform reports occasionally 400, 500, and timeout errors during the rolling update. Post update you're at 1 replica and the errors go away.
Scenario 2:
You have enough traffic to warrant autoscaling and multiple replicas. You consistently get 1-3% errors, and 97% success.
Why are you getting errors in both scenarios?
Let's say it takes 1 minute to finish booting up / be ready to receive traffic. If you don't have readiness probes then newly spawned instances of your container will receive traffic before they've finished booting up / become ready to receive traffic. So the newly spawned instances are probably causing temporary 400, 500, and timeout errors.
How to fix:
You can fix the occasional errors in Scenario 1 and 2 by adding a readiness probe with an initialDelaySeconds (or startup probe), basically something that waits long enough for your container app to finish booting up.
Now the correct and proper best practice thing to do is to write a /health endpoint that properly reflects the health of your app. But writing an accurate healthcheck endpoint can take time. In many cases you can get the same end result (make the errors go away), without the effort of creating a /health endpoint by faking it, and just adding a wait period that waits for your app to finish booting up before sending traffic to it. (again /health is best practice, but for the ain't nobody got time for that crowd, faking it can be a good enough stop gap solution)
Below is a better version of a fake readiness probe:
Also here's why it's better
exec based liveness probes don't work in 100% of cases, they assume shell exists on the container, and that commands exist on the container. There's scenarios where hardened containers don't have things like a shell or touch command.
httpGet, tcpSocket, and grcp liveness probes are done from the perspective of the node running kubelet (the kubernetes agent) so they don't depend on the software installed in the container and should work in on hardened containers that are missing things like touch command or even scratch container. (In other words this soln should work in 100% of cases vs 99% of the time)
An alternative to startup probe is to use initialDelaySeconds with a readiness Probe, but that creates unnecessary traffic compared to a startup probe that runs once. (Again this isn't the best solution in terms of accuracy/fastest possible startup time, but often a good enough solution that's very practical.)
Run my example in a cluster and you'll see it's not ready for 60 seconds, then becomes ready after 60 seconds.
Since this is a fake probe it's pointless to use readiness/liveness probe, just go with startup probe as that will cut down on unnecessary traffic.
In the absence of a readiness probe the startup probe will have the effect of a readiness probe (block it from being ready until the probe passes, but only during initial start up)
apiVersion: apps/v1
kind: Deployment
metadata:
name: useful-hack
labels:
app: always-true-tcp-probe
spec:
replicas: 1
strategy:
type: Recreate #dev env fast feedback loop optimized value, don't use in prod
selector:
matchLabels:
app: always-true-tcp-probe
template:
metadata:
labels:
app: always-true-tcp-probe
spec:
containers:
- name: nginx
image: nginx:1.7.9
startupProbe:
tcpSocket:
host: 127.0.0.1 #Since kubelet does the probes, this is node's localhost, not pod's localhost
port: 10250 #worker node kubelet listening port
successThreshold: 1
failureThreshold: 2
initialDelaySeconds: 60 #wait 60 sec before starting the probe
Additional Notes:
The above example keeps traffic within the LAN this has several benefits.
It'll work in internet disconnected environments.
It won't incur egress network charges
The below example will only work for internet connected environments and isn't too bad for a startup probe, but would be a bad idea for a readiness / liveness probe as it could clog the NAT GW bandwidth, I'm only including it to point out something of interest.
startupProbe:
httpGet:
host: google.com #default's to pod IP
path: /
port: 80
scheme: HTTP
successThreshold: 1
failureThreshold: 2
initialDelaySeconds: 60
---
startupProbe:
tcpSocket:
host: 1.1.1.1 #CloudFlare
port: 53 #DNS
successThreshold: 1
failureThreshold: 2
initialDelaySeconds: 60
The interesting bit:
Remember I said "httpGet, tcpSocket, and grcp liveness probes are done from the perspective of the node running kubelet (the kubernetes agent)." Kubelet runs on the worker node's host OS, which is configured for upstream DNS, in other words it doesn't have access to inner cluster DNS entries that kubedns is aware of. So you can't specify Kubernetes service names in these probes.
Additionally Kubernetes Service IPs won't work for the probes either since they're VIPs (Virtual IPs) that only* exist in iptables (*most cases).
Related
I am trying to achieve zero-downtime deployment on k8s. My deployment has one replica. The pod probes look like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
namespace: ${KUBE_NAMESPACE}
spec:
selector:
matchLabels:
app: app
replicas: 1
template:
metadata:
labels:
app: app
spec:
containers:
- name: app-container
imagePullPolicy: IfNotPresent
image: ${DOCKER_IMAGE}:${IMAGE_TAG}
ports:
- containerPort: 80
livenessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 5
periodSeconds: 10
terminationGracePeriodSeconds: 130
However, every time after the kubectl rollout status returns and reports rollout finished. I experience a small time of bad gateway.
Then I add a test that I let /health return 500 in prestop and wait at lease 20 seconds before actually stop the pod.
# If the app test the /tmp/prestop file exists, it will return 500.
lifecycle:
preStop:
exec:
command: ["/bin/bash", "-c", "touch /tmp/prestop && sleep 20"]
Then I found after the k8s stop the pod, the traffic can still flow to the old pod(If I visit the /health, I can get a 500 result).
So it looks like the load balancer decides which pods can be used solely by the probe result. Since the probes have a period time, there is always a small window in which the pod stopped but the load balancer still doesn't know and can direct traffic to it, hence the user experiences downtime.
So my question is: in order to have a zero-downtime deployment, it seems a must to let the probes know the pod is stopping before actually stop the pod. Is this right? or I'm doing something wrong?
After digging around Google and doing some tests. I found it's not needed to manually replying 500 to probes after prestop.
According to documentation
At the same time as the kubelet is starting graceful shutdown, the control plane removes that shutting-down Pod from Endpoints (and, if enabled, EndpointSlice) objects where these represent a Service with a configured selector. ReplicaSets and other workload resources no longer treat the shutting-down Pod as a valid, in-service replica. Pods that shut down slowly cannot continue to serve traffic as load balancers (like the service proxy) remove the Pod from the list of endpoints as soon as the termination grace period begins.
The pod won't get traffics after starting shutdown. But I also found this issue said there was indeed a delay between starting shutdown a pod to actually removing it from endpoints.
So instead of return 500 to probes in prestop, I simply sleep 60 seconds in prestop. At the sametime let the /health check return 200 with a status telling the node is in running or prestop status. Then I did a rollout and got following result:
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"running"}' at 1612717529.114602
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"running"}' at 1612717530.59488
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"running"}' at 1612717532.094305
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"running"}' at 1612717533.5859041
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"running"}' at 1612717535.086944
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"running"}' at 1612717536.757241
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"running"}' at 1612717538.57626
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"prestop"}' at 1612717540.3773062
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"prestop"}' at 1612717543.2204192
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"prestop"}' at 1612717544.7196548
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"prestop"}' at 1612717546.550169
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"prestop"}' at 1612717548.01408
b'{"node_id":"a5c387f5df30","node_start_at":1612706851,"status":"prestop"}' at 1612717549.471266
b'{"node_id":"17733ca118f4","node_start_at":1612717537,"status":"running"}' at 1612717551.387528
b'{"node_id":"17733ca118f4","node_start_at":1612717537,"status":"running"}' at 1612717553.49984
b'{"node_id":"17733ca118f4","node_start_at":1612717537,"status":"running"}' at 1612717555.404394
b'{"node_id":"17733ca118f4","node_start_at":1612717537,"status":"running"}' at 1612717558.1528351
b'{"node_id":"17733ca118f4","node_start_at":1612717537,"status":"running"}' at 1612717559.64011
b'{"node_id":"17733ca118f4","node_start_at":1612717537,"status":"running"}' at 1612717561.294955
b'{"node_id":"17733ca118f4","node_start_at":1612717537,"status":"running"}' at 1612717563.366436
b'{"node_id":"17733ca118f4","node_start_at":1612717537,"status":"running"}' at 1612717564.972768
The a5c387f5df30 node still got traffic after the prestop hook been called. After around 10 seconds, it never received traffic then. So it's not related to anything I did in prestop, it's purely a delay.
I did this test on AWS EKS with fargate. I don't know what's the status about other k8s platform.
It all depends on what your app is doing when it receives the SIGTERM signal from kubernetes.
In order to gracefully shutdown your app, you should listen to SIGTERM event and dry all your connection, in addition to that, you should start replying with 500 from your readiness endpoint, which will make kubernetes stop sending your new requests.
There are many articles touching that topic which you can fine on google
https://www.driftrock.com/blog/kubernetes-zero-downtime-rolling-updates
https://learnk8s.io/graceful-shutdown
I have been trying to debug a very odd delay in my K8S deployments. I have tracked it down to the simple reproduction below. What it appears is that if I set an initialDelaySeconds on a startup probe or leave it 0 and have a single failure, then the probe doesn't get run again for a while and ends up with atleast a 1-1.5 minute delay getting into Ready:true state.
I am running locally with Ubutunu 18.04 and microk8s v1.19.3 with the following versions:
kubelet: v1.19.3-34+a56971609ff35a
kube-proxy: v1.19.3-34+a56971609ff35a
containerd://1.3.7
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: microbot
name: microbot
spec:
replicas: 1
selector:
matchLabels:
app: microbot
strategy: {}
template:
metadata:
labels:
app: microbot
spec:
containers:
- image: cdkbot/microbot-amd64
name: microbot
command: ["/bin/sh"]
args: ["-c", "sleep 3; /start_nginx.sh"]
#args: ["-c", "/start_nginx.sh"]
ports:
- containerPort: 80
startupProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 0 # 5 also has same issue
periodSeconds: 1
failureThreshold: 10
successThreshold: 1
##livenessProbe:
## httpGet:
## path: /
## port: 80
## initialDelaySeconds: 0
## periodSeconds: 10
## failureThreshold: 1
resources: {}
restartPolicy: Always
serviceAccountName: ""
status: {}
---
apiVersion: v1
kind: Service
metadata:
name: microbot
labels:
app: microbot
spec:
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: microbot
The issue is that if I have any delay in the startupProbe or if there is an initial failure, the pod gets into Initialized:true state but had Ready:False and ContainersReady:False. It will not change from this state for 1-1.5 minutes. I haven't found a pattern to the settings.
I left in the comment out settings as well so you can see what I am trying to get to here. What I have is a container starting up that has a service that will take a few seconds to get started. I want to tell the startupProbe to wait a little bit and then check every second to see if we are ready to go. The configuration seems to work, but there is a baked in delay that I can't track down. Even after the startup probe is passing, it does not transition the pod to Ready for more than a minute.
Is there some setting elsewhere in k8s that is delaying the amount of time before a Pod can move into Ready if it isn't Ready initially?
Any ideas are greatly appreciated.
Actually I made a mistake in comments, you can use initialDelaySeconds in startupProbe, but you should rather use failureThreshold and periodSeconds instead.
As mentioned here
Kubernetes Probes
Kubernetes supports readiness and liveness probes for versions ≤ 1.15. Startup probes were added in 1.16 as an alpha feature and graduated to beta in 1.18 (WARNING: 1.16 deprecated several Kubernetes APIs. Use this migration guide to check for compatibility).
All the probe have the following parameters:
initialDelaySeconds : number of seconds to wait before initiating
liveness or readiness probes
periodSeconds: how often to check the probe
timeoutSeconds: number of seconds before marking the probe as timing
out (failing the health check)
successThreshold : minimum number of consecutive successful checks
for the probe to pass
failureThreshold : number of retries before marking the probe as
failed. For liveness probes, this will lead to the pod restarting.
For readiness probes, this will mark the pod as unready.
So why should you use failureThreshold and periodSeconds?
consider an application where it occasionally needs to download large amounts of data or do an expensive operation at the start of the process. Since initialDelaySeconds is a static number, we are forced to always take the worst-case scenario (or extend the failureThreshold that may affect long-running behavior) and wait for a long time even when that application does not need to carry out long-running initialization steps. With startup probes, we can instead configure failureThreshold and periodSeconds to model this uncertainty better. For example, setting failureThreshold to 15 and periodSeconds to 5 means the application will get 15 (fifteen) x 5 (five) = 75s to startup before it fails.
Additionally if you need more informations take a look at this article on medium.
Quoted from kubernetes documentation about Protect slow starting containers with startup probes
Sometimes, you have to deal with legacy applications that might require an additional startup time on their first initialization. In such cases, it can be tricky to set up liveness probe parameters without compromising the fast response to deadlocks that motivated such a probe. The trick is to set up a startup probe with the same command, HTTP or TCP check, with a failureThreshold * periodSeconds long enough to cover the worse case startup time.
So, the previous example would become:
ports:
- name: liveness-port
containerPort: 8080
hostPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 1
periodSeconds: 10
startupProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 30
periodSeconds: 10
Thanks to the startup probe, the application will have a maximum of 5 minutes (30 * 10 = 300s) to finish its startup. Once the startup probe has succeeded once, the liveness probe takes over to provide a fast response to container deadlocks. If the startup probe never succeeds, the container is killed after 300s and subject to the pod's restartPolicy.
I have a k8s cluster. Our service is queue based. Our pod subscribe to an event queue,fetch event and do tasks. So for this kind of service, how to define k8s liveness probe and readiness probe?
Following is a very brief introduction to these probes:
Liveliness Probe is for the Kubernetes to know if a workload is healthy. It could be a shell command executed in your container or a simple tcp/http request which should respond positively.
If a liveliness check fails after a period of timeout which is specified in the pod config, Kubrenetes will restart the workload.
So, if your workload is doing time consuming processes, you might need to give your liveliness probe enough time to make sure that your pod is not restarted unduely.
Rediness Probe is for the Kubernetes proxy to decide if your workload is ready for consuming traffic. The traffic will be sent to your pod only if the rediness probe responds positively. So, if your workload needs more time processing a single request and needs other requests to be diverted to other replicas for fast processing during this time, you might want to give a slightly high rediness interval to the workloads.
These probe parameters, combined with number of replicas can ensure fast and healthy functioning of your application. It is very important to understand the area each probe cover and the parameters you can tune them by.
Here are some reads:
https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Assuming your problem is because is a processing worker consuming queue messages it doesn't expose any port to check.
In that case, you can define livenessProbe and readinessProbe custom command, next is an example from the documnetation:
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
Also, take in mind the time your process takes to be live and ready to adjust the initialDelaySeconds and periodSeconds to not kill the pod before it is fully loeaded.
I was testing my kubernetes services recently. And I found it's very unreliable. Here are the situation:
1. The test service 'A' which receives HTTP requests at port 80 has five pods deployed on three nodes.
2. An nginx ingress was set to route traffic outside onto the service 'A'.
3. The ingress was set like this:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: test-A
annotations:
nginx.ingress.kubernetes.io/proxy-connect-timeout: "1s"
nginx.ingress.kubernetes.io/proxy-next-upstream: "error timeout invalid_header http_502 http_503 http_504"
nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "2"
spec:
rules:
- host: <test-url>
http:
paths:
- path: /
backend:
serviceName: A
servicePort: 80
http_load was started on an client host and kept sending request to the ingress nginx at a speed of 1000 per-seconds. All the request were routed to the service 'A' in k8s and eveything goes well.
When I restarted one of the nodes manually, things went wrong:
In the next 3 minutes, about 20% requests were timeout, which is unacceptable in product environment.
I don't know why k8s reacts so slow and is there a way to solve this problem?
You can speed up that fail-over process by configuring liveness and readiness probes in the pods' spec:
Container probes
...
livenessProbe: Indicates whether the Container is running. If the liveness probe fails, the kubelet kills the Container, and the Container is subjected to its restart policy. If a Container does not provide a liveness probe, the default state is Success.
readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod. The default state of readiness before the initial delay is Failure. If a Container does not provide a readiness probe, the default state is Success.
Liveness probe example:
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
Thanks for #VAS's answer!
Liveness probe is a way to solve this problem.
But I finally figured out that what I want was passive health check, which the k8s dosen't surpport.
And I solved this problem by introducing istio into my cluster.
We are running a workload against a cluster hosting 2 instances of a small (3 container) pod. Accessing the pod using a service w/nodeport. If we stop a pod and rc starts a new one, our constant (low volume) workload has numerous failures (Rational Perf Tester, http test hitting the service on the master ... but likely same if it were hitting either minion ... master also has a minion). Anyway, if we just add a pod with kubectl scale, we also get errors. If we then take down this a pod (rc doesn't start a new one because we had one more than needed due to scale) ... no errors. Seems that service starts sending work to new pod because kubelet has done his thing, even though containers are not up. Thus, any time a pod is started ... it starts receiving work a little too soon (after kubelet did his work, but before all containers are ready). Is there a way to guarantee that the service will not route to this pod until all containers are up? Barring that is there some way to say wait 'n' seconds before sending to this pod? I may be wrong, but behavior seems to suggest this scenario.
This is precisely what the readinessProbe option is :)
It's documented more here and here, and is part of the container definition in a pod specification.
For example, you might use a pod specification like the one below to ensure that your nginx pod won't be marked as ready (and thus won't have traffic sent to it) until it responds to an HTTP request for /index.html:
apiVersion: v1
kind: ReplicationController
metadata:
name: my-nginx
spec:
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
lifecycle:
httpGet:
path: /index.html
port: 80
initialDelaySeconds: 10
timeoutSeconds: 5