Is it possible to specify a delay for pod restart when Kubernetes liveness probe fails? - kubernetes

Got a simple REST API server built with python gunicorn, which runs multiple threads to accept requests. After running for some time, some of these threads crash. Got a script to detect the number of dead threads (using log files). Once this number crosses some threshold, we want to restart gunicorn. This script is configured to be used as liveness probe.
The script works fine and restarts the pod as expected. But there are a few live threads that are still processing requests. Also, gunicorn keeps a backlog queue of accepted requests that it cannot process yet, since other requests are processing. Is there a way to specify a delay for the pod restart so the other running threads and the backlog requests have some time to finish processing?

You can use a prestop hook. Offcial docs here
How to use documented here.
You can also use terminationGracePeriodSeconds to allow graceful termination of pod.
Best Practices here

You can configure graceful pod termination with terminationGracePeriodSeconds
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: test
spec:
replicas: 1
template:
spec:
containers:
- name: test
image: ...
terminationGracePeriodSeconds: 60

Related

Does Kubernetes support green-blue deployment?

I would like to ask on the mechanism for stopping the pods in kubernetes.
I read https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods before ask the question.
Supposably we have a application with gracefully shutdown support
(for example we use simple http server on Go https://play.golang.org/p/5tmkPPMiSSt).
Server has two endpoints:
/fast, always send 200 http status code.
/slow, wait 10 seconds and send 200 http status code.
There is deployment/service resource with that configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app/name: test
template:
metadata:
labels:
app/name: test
spec:
terminationGracePeriodSeconds: 120
containers:
- name: service
image: host.org/images/grace:v0.1
livenessProbe:
httpGet:
path: /health
port: 10002
failureThreshold: 1
initialDelaySeconds: 1
readinessProbe:
httpGet:
path: /health
port: 10002
failureThreshold: 1
initialDelaySeconds: 1
---
apiVersion: v1
kind: Service
metadata:
name: test
spec:
type: NodePort
ports:
- name: http
port: 10002
targetPort: 10002
selector:
app/name: test
To make sure the pods deleted gracefully I conducted two test options.
First option (slow endpoint) flow:
Create deployment with replicas value equal 1.
Wait for pod readness.
Send request on /slow endpoint (curl http://ip-of-some-node:nodePort/slow) and delete pod (simultaneously, with 1 second out of sync).
Expected:
Pod must not end before http server completed my request.
Got:
Yes, http server process in 10 seconds and return response for me.
(if we pass --grace-period=1 option to kubectl, then curl will write - curl: (52) Empty reply from server)
Everything works as expected.
Second option (fast endpoint) flow:
Create deployment with replicas value equal 10.
Wait for pods readness.
Start wrk with "Connection: close" header.
Randomly delete one or two pods (kubectl delete pod/xxx).
Expected:
No socket errors.
Got:
$ wrk -d 2m --header "Connection: Close" http://ip-of-some-node:nodePort/fast
Running 2m test # http://ip-of-some-node:nodePort/fast
Thread Stats Avg Stdev Max +/- Stdev
Latency 122.35ms 177.30ms 1.98s 91.33%
Req/Sec 66.98 33.93 160.00 65.83%
15890 requests in 2.00m, 1.83MB read
Socket errors: connect 0, read 15, write 0, timeout 0
Requests/sec: 132.34
Transfer/sec: 15.64KB
15 socket errors on read, that is, some pods were disconnected from the service before all requests were processed (maybe).
The problem appears when a new deployment version is applied, scale down and rollout undo.
Questions:
What's reason of that behavior?
How to fix it?
Kubernetes version: v1.16.2
Edit 1.
The number of errors changes each time, but remains in the range of 10-20, when removing 2-5 pods in two minutes.
P.S. If we will not delete a pod, we don't got errors.
Does Kubernetes support green-blue deployment?
Yes, it does. You can read about it on Zero-downtime Deployment in Kubernetes with Jenkins,
A blue/green deployment is a change management strategy for releasing software code. Blue/green deployments, which may also be referred to as A/B deployments require two identical hardware environments that are configured exactly the same way. While one environment is active and serving end users, the other environment remains idle.
Container technology offers a stand-alone environment to run the desired service, which makes it super easy to create identical environments as required in the blue/green deployment. The loosely coupled Services - ReplicaSets, and the label/selector-based service routing in Kubernetes make it easy to switch between different backend environments.
I would also recommend reading Kubernetes Infrastructure Blue/Green deployments.
Here is a repository with examples from codefresh.io about blue green deployment.
This repository holds a bash script that allows you to perform blue/green deployments on a Kubernetes cluster. See also the respective blog post
Prerequisites
As a convention the script expects
The name of your deployment to be $APP_NAME-$VERSION
Your deployment should have a label that shows it version
Your service should point to the deployment by using a version selector, pointing to the corresponding label in the deployment
Notice that the new color deployment created by the script will follow the same conventions. This way each subsequent pipeline you run will work in the same manner.
You can see examples of the tags with the sample application:
service
deployment
You might be also interested in Canary deployment:
Another deployment strategy is using Canaries (a.k.a. incremental rollouts). With canaries, the new version of the application is gradually deployed to the Kubernetes cluster while getting a very small amount of live traffic (i.e. a subset of live users are connecting to the new version while the rest are still using the previous version).
...
The small subset of live traffic to the new version acts as an early warning for potential problems that might be present in the new code. As our confidence increases, more canaries are created and more users are now connecting to the updated version. In the end, all live traffic goes to canaries, and thus the canary version becomes the new “production version”.
EDIT
Questions:
What's reason of that behavior?
When new deployment is being applied old pods are being removed and new ones are being scheduled.
This is being done by Control Plan
For example, when you use the Kubernetes API to create a Deployment, you provide a new desired state for the system. The Kubernetes Control Plane records that object creation, and carries out your instructions by starting the required applications and scheduling them to cluster nodes–thus making the cluster’s actual state match the desired state.
You have only setup a readinessProbe, which tells your service if it should send traffic to the pod or not. This is not a good solution as like you can see in your example if you have 10 pods and remove one or two there is a gap and you receive socket error.
How to fix it?
You have to understand this is not broken so it doesn't need a fix.
This might be mitigated by implementing a check in your application to make sure it's sending request to working address or utilize other features like load balancing like ingress.
Also when you are updating deployment you can do checks before deleting the pod to check if it does have any traffic incoming/outgoing and roll the update to only not used pods.

Is there a way to downscale pods only when message is processed (the pod finished its task) with the HorizontalPodAutoscaler in Kubernetes?

I'v setup Kubernetes Horizontal Pod Autoscaler with custom metrics using the prometheus adapter https://github.com/DirectXMan12/k8s-prometheus-adapter. Prometheus is monitoring rabbitmq, and Im watching the rabbitmq_queue_messages metric. The messages from the queue are picked up by the pods, that then do some processing, which can last for several hours.
The scale-up and scale-down is working based on the number of messages in the queue.
The problem:
When a pod finishes the processing and acks the message, that will lower the num. of messages in the queue, and that would trigger the Autoscaler terminate a pod. If I have multipe pods doing the processing and one of them finishes, if Im not mistaking, Kubernetes could terminate a pod that is still doing the processing of its own message. This wouldnt be desirable as all the processing that the pod is doing would be lost.
Is there a way to overcome this, or another way how this could be acheveed?
here is the Autoscaler configuration:
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
name: sample-app-rabbitmq
namespace: monitoring
spec:
scaleTargetRef:
# you created above
apiVersion: apps/v1
kind: Deployment
name: sample-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Object
object:
target:
kind: Service
name: rabbitmq-cluster
metricName: rabbitmq_queue_messages_ready
targetValue: 5
You could consider approach using preStop hook.
As per documentation Container States, Define postStart and preStop handlers:
Before a container enters into Terminated, preStop hook (if any) is executed.
So you can use in your deployment:
lifecycle:
preStop:
exec:
command: ["your script"]
### update:
I would like to provide more information due to some research:
There is an interesting project:
KEDA allows for fine grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.
KEDA can run on both the cloud and the edge, integrates natively with Kubernetes components such as the Horizontal Pod Autoscaler, and has no external dependencies.
For the main question "Kubernetes could terminate a pod that is still doing the processing of its own message".
As per documentation:
"Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features"
Deployment is backed by Replicaset. As per this controller code there exist function "getPodsToDelete". In combination with "filteredPods" it gives the result: "This ensures that we delete pods in the earlier stages whenever possible."
So as proof of concept:
You can create deployment with init container. Init container should check if there is a message in the queue and exit when at least one message appears. This will allow main container to start, take and process that message. In this case we will have two kinds of pods - those which process the message and consume CPU and those who are in the starting state, idle and waiting for the next message. In this case starting containers will be deleted at the first place when HPA decide to decrease number of replicas in the deployment.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: complete
template:
metadata:
creationTimestamp: null
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
dnsPolicy: ClusterFirst
restartPolicy: Always
terminationGracePeriodSeconds: 30
Hope this help.
Horizontal Pod Autoscaler is not designed for long-running tasks, and will not be a good fit. If you need to spawn one long-running processing tasks per message, I'd take one of these two approaches:
Use a task queue such as Celery. It is designed to solve your exact problem: have a queue of tasks that needs to be distributed to workers, and ensure that the tasks run to completion. Kubernetes even provides an official example of this setup.
If you don't want to introduce another component such as Celery, you can spawn a Kubernetes job for every incoming message by yourself. Kubernetes will make sure that the job runs to completion at least once - reschedule the pod if it dies, etc. In this case you will need to write a script that reads RabbitMQ messages and creates jobs for them by yourself.
In both cases, make sure you also have Cluster Autoscaler enabled so that new nodes get automatically provisioned if your current nodes are not sufficient to handle the load.

Have kube jobs start on waiting pods

I am working on a scenario where I want to be able to maintain some X number of pods in waiting (and managed by kube) and then upon user request (via some external system) have a kube job start on one of those waiting pods. So now the waiting pods count is X-1 and kube starts another pod to bring this number back to X.
This way I'll be able to cut down on the time taken to create a pod, start a container and getting is ready to start actual processing. The processing data can be sent to those pods via some sort of messaging (akka or rabbitmq).
I think the ReplicationControllers are best place to keep idle pods, but when I create a job how can I specify that I want to be able to use one of the pods that are in waiting and are managed by ReplicationController.
I think I got this to work upto a state on top of which I can build this solution.
So what I am doing is starting a RC with replicas: X (X is the number of idle pods I wish to maintain, usually not a very large number). The pods that it starts have custom label status: idle or something like that. The RC spec.selector has the same custom label value to match with the pods that it manages, so spec.selector.status: idle. When creating this RC, kube ensures that it creates X pods with their status=idle. Somewhat like below:
apiVersion: v1
kind: ReplicationController
metadata:
name: testrc
spec:
replicas: 3
selector:
status: idle
template:
metadata:
name: idlepod
labels:
status: idle
spec:
containers:
...
On the other hand I have a job yaml that has spec.manualSelector: true (and yes I have taken into account that the label set has to be unique). With manualSelector enabled, I can now define selectors on the job like below.
apiVersion: batch/v1
kind: Job
metadata:
generateName: testjob-
spec:
manualSelector: true
selector:
matchLabels:
status: active
...
So clearly, RC creates pods with status=idle and job expects to use pods with status=active because of the selector.
So now whenever I have a request to start a new job, I'll update label on one of the pods managed by RC so that its status=active. The selector on RC will effect the release of this pod from its control and start another one because of replicas: X set on it. And the released pod is no longer controller by RC and is now orphan. Finally, when I create a job, the selector on this job template will match the label of the orphaned pod and this pod will then be controlled by the new job. I'll send messages to this pod that will start the actual processing and finally bring it to complete.
P.S.: Pardon my formatting. I am new here.

How to set a time limit for a Kubernetes job?

I'd like to launch a Kubernetes job and give it a fixed deadline to finish. If the pod is still running when the deadline comes, I'd like the job to automatically be killed.
Does something like this exist? (At first I thought that the Job spec's activeDeadlineSeconds covered this use case, but now I see that activeDeadlineSeconds only places a limit on when a job is re-tried; it doesn't actively kill a slow/runaway job.)
You can self-impose timeouts on the container's entrypoint command by using GNU timeout utility.
For example the following Job that computes first 4000 digits of pi will time out after 10 seconds:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
metadata:
name: pi
spec:
containers:
- name: pi
image: perl
command: ["/usr/bin/timeout", "10", "perl", "-Mbignum=bpi", "-wle", "print bpi(4000)"]
restartPolicy: Never
(Manifest adopted from https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#running-an-example-job)
You can play with the numbers and see it timeout or not. Typically computing 4000 digits of pi takes ~23 seconds on my workstation, so if you set it to 5 seconds it'll probably always fail and if you set it to 120 seconds it will always work.
From the way I understand the documentation of activeDeadlineSeconds section is that it refers to the active time of a Job and after this time the Job is considered Failed.
Official doc statement:
The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded
https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
You could instead add the activeDeadlineSeconds to the pod spec in the pod template defined as part of the job. This way the pods which are spawned by the job are limited with the timeout.

Make Kubernetes wait for Pod termination before removing from Service endpoints

According to Termination of Pods, step 7 occurs simultaneously with 3. Is there any way I can prevent this from happening and have 7 occur only after the Pod's graceful termination (or expiration of the grace period)?
The reason why I need this is that my Pod's termination routine requires my-service-X.my-namespace.svc.cluster.local to resolve to the Pod's IP during the whole process, but the corresponding Endpoint gets removed as soon as I run kubectl delete on the Pod / Deployment.
Note: In case it helps making this clear, I'm running a bunch of clustered VerneMQ (Erlang) nodes which, on termination, dump their contents to other nodes on the cluster — hence the need for the nodenames to resolve correctly during the whole termination process. Only then should the corresponding Endpoints be removed.
Unfortunately kubernetes was designed to remove the Pod from the endpoints at the same time as the prestop hook is started (see link in question to kubernetes docs):
At the same time as the kubelet is starting graceful shutdown, the
control plane removes that shutting-down Pod from Endpoints
This google kubernetes docs says it even more clearly:
Pod is set to the “Terminating” State and removed from the
endpoints list of all Services
There also was also a feature request for that. which was not recognized.
Solution for helm users
But if you are using helm, you can use hooks (e.g. pre-delete,pre-upgrade,pre-rollback). Unfortunately this helm hook is an extra pod which can not access all pod resources.
This is an example for a hook:
apiVersion: batch/v1
kind: Job
metadata:
name: graceful-shutdown-hook
annotations:
"helm.sh/hook": pre-delete,pre-upgrade,pre-rollback
labels:
app.kubernetes.io/name: graceful-shutdown-hook
spec:
template:
spec:
containers:
- name: graceful-shutdown
image: busybox:1.28.2
command: ['sh', '-cx', '/bin/sleep 15']
restartPolicy: Never
backoffLimit: 0
Maybe you should consider using headless service instead of using ClusterIP one. That way your apps will discover using the actual endpoint IPs and the removal from endpoint list will not break the availability during shutdown, but will remove from discovery (or from ie. ingress controller backends in nginx contrib)