How can I have Kubernates automatically restart a container which purposefully exits in order to get new data from environment variables?
I have a container running on a Kubernates cluster which operates in the following fashion:
Container starts, polls for work
If it receives a task, it does some work
It polls for work again, until ...
.. the container has been running for over a certain period of time, after which it exits instead of polling for more work.
It needs to be continually restarted, as it uses environment variables which are populated by Kubernates secrets which are periodically refreshed by another process.
I've tried a Deployment, but it doesn't seem like the right fit as I get CrashLoopBackOff status, which means the worker is scheduled less and less often.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-fonky-worker
labels:
app: my-fonky-worker
spec:
replicas: 2
selector:
matchLabels:
app: my-fonky-worker
template:
metadata:
labels:
app: my-fonky-worker
spec:
containers:
- name: my-fonky-worker-container
image: my-fonky-worker:latest
env:
- name: NOTSOSECRETSTUFF
value: cats_are_great
- name: SECRETSTUFF
valueFrom:
secretKeyRef:
name: secret-name
key: secret-key
I've also tried a CronJob, but that seems a bit hacky as it could mean that the container is left in the stopped state for several seconds.
As #Josh said you need to exit with exit 0 else it will be treated as a failed container! Here is the reference According to the first example there "Pod is running and has one Container. Container exits with success." if your restartPolicy is set to Always (which is default by the way) then the container will restart although the Pod status shows running but if you log the pod then you can see the restart of the container.
It needs to be continually restarted, as it uses environment variables which are populated by Kubernates secrets which are periodically refreshed by another process.
I would take a different approach to this. I would mount the config map as explained here this will automatically refresh the Mounted config maps data Ref. Note: please take care of the " kubelet sync period (1 minute by default) + ttl of ConfigMaps cache (1 minute by default) in kubelet" to manage the refresh rate of configmap data in the Pod.
What I see as a solution for this would be to run your container as a cronjob. but don't use startingDeadlineSeconds as your container killer.
It runs on its schedule.
In your container you can have it poll for work N times.
After N times it exits 0.
If I understood correctly in your example there are 2 problems:
Restarting container
Updating secret values
In order to keep your secrets up to date you should consider using secrets as described by Amit Kumar Gupta comment and mount secrets as volume instead of environment variable, here is an example.
As per the second problem with restarting container it depends on what is the exit code as described by garlicFrancium
From another point of view you can use init container waiting for new tasks and main container in order to proceed this tasks according to your requirements or create job scheduler.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 1
selector:
matchLabels:
app: complete
template:
metadata:
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
restartPolicy: Always
Please note:
When a secret being already consumed in a volume is updated, projected keys are eventually updated as well. Kubelet is checking whether the mounted secret is fresh on every periodic sync. However, it is using its local cache for getting the current value of the Secret.
The type of the cache is configurable using the (ConfigMapAndSecretChangeDetectionStrategy field in KubeletConfiguration struct). It can be either propagated via watch (default), ttl-based, or simply redirecting all requests to directly kube-apiserver. As a result, the total delay from the moment when the Secret is updated to the moment when new keys are projected to the Pod can be as long as kubelet sync period + cache propagation delay, where cache propagation delay depends on the chosen cache type (it equals to watch propagation delay, ttl of cache, or zero corespondingly).
A container using a Secret as a subPath volume mount will not receive Secret updates.
Please refer also to:
Fine Parallel Processing Using a Work Queue
Related
I have a microservice for handling retention policy.
This application has default configuration for retention, e.g.: size for retention, files location etc.
But we also want create an API for the user to change this configuration with customized values on runtime.
I created a configmap with the default values, and in the application I used k8s client library to get/update/watch the configmap.
My question is, is it correct to use configmap for dynamic buisness configuration? or is it meant for static configuration that user is not supposed to touch during runtime?
Thanks in advance
There are no rules against it. A lot of software leverages kube API to do some kind of logic / state, ie. leader election. All of those require the app to apply changes to a kube resource. With that in mind do remember it always puts some additional load on your API and if you're unlucky that might become an issue. About two years ago we've been experiencing API limits exhaustion on one of the managed k8s services cause we were using a lot of deployments that had rather intensive leader election logic (2 requests per pod every 5 sec). The issue is long gone since then, but it shows what you have to take into account when designing interactions like this (retries, backoffs etc.)
Using configMaps is perfectly fine for such use cases. You can use a client library in order to watch for updates on the given configMap, however a cleaner solution would be to mount the configMap as a file into the pod and have your configuration set up from the given file. Since you're mounting the configMap as a Volume, changes won't need a pod restart for changes to be visible within the pod (unlike env variables that only "refresh" once the pod get's recreated).
Let's say you have this configMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: special-config
namespace: default
data:
SPECIAL_LEVEL: very
SPECIAL_TYPE: charm
And then you mount this configMap as a Volume into your Pod:
apiVersion: v1
kind: Pod
metadata:
name: dapi-test-pod
spec:
containers:
- name: test-container
image: registry.k8s.io/busybox
command: [ "/bin/sh", "-c", "ls /etc/config/" ]
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
# Provide the name of the ConfigMap containing the files you want
# to add to the container
name: special-config
restartPolicy: Never
When the pod runs, the command ls /etc/config/ produces the output below:
SPECIAL_LEVEL
SPECIAL_TYPE
This way you would also reduce "noise" to the API-Server as you can simply query the given files for updates to any configuration.
I have a specific scenario where I'd like to have a deployment controlled by horizontal pod autoscaling. To handle database migrations in pods when pushing a new deployment, I followed this excellent tutorial by Andrew Lock here.
In short, you must define an initContainer that waits for a Kubernetes Job to complete a process (like running db migrations) before the new pods can run.
This works well, however, I'm not sure how to handle HPA after the initial deployment because if the system detects the need to add another Pod in my node, the initContainer defined in my deployment requires a Job to be deployed and run, but since Jobs are one-off processes, the pod can not initialize and run properly (a ttlSecondsAfterFinished attribute removes the Job anyways).
How can I define an initContainer to run when I deploy my app so I can push my database migrations in a Job, but also allow HPA to control dynamically adding a Pod without needing an initContainer?
Here's what my deployment looks like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: graphql-deployment
spec:
replicas: 1
selector:
matchLabels:
app: graphql-pod
template:
metadata:
labels:
app: graphql-pod
spec:
initContainers:
- name: wait-for-graphql-migration-job
image: groundnuty/k8s-wait-for:v1.4 # This is an image that waits for a process to complete
args:
- job
- graphql-migration-job # this job is defined next
containers:
- name: graphql-container
image: image(graphql):tag(graphql)
The following Job is also deployed
apiVersion: batch/v1
kind: Job
metadata:
name: graphql-migration-job
spec:
ttlSecondsAfterFinished: 30
template:
spec:
containers:
- name: graphql-migration-container
image: image(graphql):tag(graphql)
command: ["npm", "run", "migrate:reset"]
restartPolicy: Never
So basically what happens is:
I deploy these two resources to my node
Job is initialized
initContainer on Pod waits for Job to complete using an image called groundnuty/k8s-wait-for:v1.4
Job completes
initContainer completes
Pod initializes
(after 30 TTL seconds) Job is removed from node
(LOTS OF TRAFFIC)
HPA realizes a need for another pod
initContainer for NEW pod is started, but cant run because Job doesn't exist
...crashLoopBackOff
Would love any insight on the proper way to handle this scenario!
There is, unfortunately, no simple Kubernetes feature to resolve your issue.
I recommend extending your deployment tooling/scripts to separate the migration job and your deployment. During the deploy process, you first execute the migration job and then deploy your deployment. Without the job attached, the HPA can nicely scale your pods.
There is a multitude of ways to achieve this:
Have a bash, etc. script first to execute the job, wait and then update your deployment
Leverage more complex deployment tooling like Helm, which allows you to add a 'pre-install hook' to your job to execute them when you deploy your application
My yaml file
apiVersion: batch/v1
kind: Job
metadata:
name: auto
labels:
app: auto
spec:
backoffLimit: 5
activeDeadlineSeconds: 100
template:
metadata:
labels:
app: auto
spec:
containers:
- name: auto
image: busybox
imagePullPolicy: Always
ports:
- containerPort: 9080
imagePullSecrets:
- name: imageregistery
restartPolicy: Never
The pods are killed appropriately but the job ceases to kill itself post 100 seconds.
Is there anything that we could do to kill the job post the container/pod's functionality is completed.
kubectl version --short
Client Version: v1.6.1
Server Version: v1.13.10+IKS
kubectl get jobs --namespace abc
NAME DESIRED SUCCESSFUL AGE
auto 1 1 26m
Thank you,
The default way to delete jobs after they are done is to use kubectl delete command.
As mentioned by #Erez:
Kubernetes is keeping pods around so you can get the
logs,configuration etc from it.
If you don't want to do that manually you could write a script running in your cluster that would check for jobs with completed status and than delete them.
Another way would be to use TTL feature that deletes the jobs automatically after a specified number of seconds. However, if you set it to zero it will clean them up immediately. For more details of how to set it up look here.
Please let me know if that helped.
I'v setup Kubernetes Horizontal Pod Autoscaler with custom metrics using the prometheus adapter https://github.com/DirectXMan12/k8s-prometheus-adapter. Prometheus is monitoring rabbitmq, and Im watching the rabbitmq_queue_messages metric. The messages from the queue are picked up by the pods, that then do some processing, which can last for several hours.
The scale-up and scale-down is working based on the number of messages in the queue.
The problem:
When a pod finishes the processing and acks the message, that will lower the num. of messages in the queue, and that would trigger the Autoscaler terminate a pod. If I have multipe pods doing the processing and one of them finishes, if Im not mistaking, Kubernetes could terminate a pod that is still doing the processing of its own message. This wouldnt be desirable as all the processing that the pod is doing would be lost.
Is there a way to overcome this, or another way how this could be acheveed?
here is the Autoscaler configuration:
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
name: sample-app-rabbitmq
namespace: monitoring
spec:
scaleTargetRef:
# you created above
apiVersion: apps/v1
kind: Deployment
name: sample-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Object
object:
target:
kind: Service
name: rabbitmq-cluster
metricName: rabbitmq_queue_messages_ready
targetValue: 5
You could consider approach using preStop hook.
As per documentation Container States, Define postStart and preStop handlers:
Before a container enters into Terminated, preStop hook (if any) is executed.
So you can use in your deployment:
lifecycle:
preStop:
exec:
command: ["your script"]
### update:
I would like to provide more information due to some research:
There is an interesting project:
KEDA allows for fine grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.
KEDA can run on both the cloud and the edge, integrates natively with Kubernetes components such as the Horizontal Pod Autoscaler, and has no external dependencies.
For the main question "Kubernetes could terminate a pod that is still doing the processing of its own message".
As per documentation:
"Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features"
Deployment is backed by Replicaset. As per this controller code there exist function "getPodsToDelete". In combination with "filteredPods" it gives the result: "This ensures that we delete pods in the earlier stages whenever possible."
So as proof of concept:
You can create deployment with init container. Init container should check if there is a message in the queue and exit when at least one message appears. This will allow main container to start, take and process that message. In this case we will have two kinds of pods - those which process the message and consume CPU and those who are in the starting state, idle and waiting for the next message. In this case starting containers will be deleted at the first place when HPA decide to decrease number of replicas in the deployment.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: complete
template:
metadata:
creationTimestamp: null
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
dnsPolicy: ClusterFirst
restartPolicy: Always
terminationGracePeriodSeconds: 30
Hope this help.
Horizontal Pod Autoscaler is not designed for long-running tasks, and will not be a good fit. If you need to spawn one long-running processing tasks per message, I'd take one of these two approaches:
Use a task queue such as Celery. It is designed to solve your exact problem: have a queue of tasks that needs to be distributed to workers, and ensure that the tasks run to completion. Kubernetes even provides an official example of this setup.
If you don't want to introduce another component such as Celery, you can spawn a Kubernetes job for every incoming message by yourself. Kubernetes will make sure that the job runs to completion at least once - reschedule the pod if it dies, etc. In this case you will need to write a script that reads RabbitMQ messages and creates jobs for them by yourself.
In both cases, make sure you also have Cluster Autoscaler enabled so that new nodes get automatically provisioned if your current nodes are not sufficient to handle the load.
running kubernetes v1.2.2 on coreos on vmware:
I have a pod with the restart policy set to Never. Is it possible to manually start the same pod back up?
In my use case we will have a postgres instance in this pod. If it was to crash I would like to leave the pod in a failed state until we can look at it closer to see why it failed and then start it manually. Rather than try to restart with a restartpolicy of Always.
Looking through kubectl it doesnt seem like there is a manual start option. I could delete and recreate but i think this would remove the data from my container. Maybe I should be mounting a local volume on my host, and I should not need to worry about losing data?
this is my sample pod yaml. I dont seem to be able to restart the 'health' pod.
apiVersion: v1
kind: Pod
metadata:
name: health
labels:
environment: dev
app: health
spec:
containers:
- image: busybox
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
restartPolicy: Never
One simple method that might address your needs is to add a unique instance label, maybe a simple counter. If each pod is labelled differently you can start as many as you like and keep around as many failed instances as you like.
e.g. first pod
apiVersion: v1
kind: Pod
metadata:
name: health
labels:
environment: dev
app: health
instance: 0
spec:
containers: ...
second pod
apiVersion: v1
kind: Pod
metadata:
name: health
labels:
environment: dev
app: health
instance: 1
spec:
containers: ...
Based on your question and comments sounds like you want to restart a failed container to retain its state and data. In fact, application containers and pods are considered to be relatively ephemeral (rather than durable) entities. When a container crashes its files will be lost and kubelet will restart it with a clean state.
To retain your data and logs use persistent volume types in your deployment. This will let you to preserve data across container restarts.