Horizontal Pod Autoscaling (HPA) with an initContainer that requires a Job - kubernetes

I have a specific scenario where I'd like to have a deployment controlled by horizontal pod autoscaling. To handle database migrations in pods when pushing a new deployment, I followed this excellent tutorial by Andrew Lock here.
In short, you must define an initContainer that waits for a Kubernetes Job to complete a process (like running db migrations) before the new pods can run.
This works well, however, I'm not sure how to handle HPA after the initial deployment because if the system detects the need to add another Pod in my node, the initContainer defined in my deployment requires a Job to be deployed and run, but since Jobs are one-off processes, the pod can not initialize and run properly (a ttlSecondsAfterFinished attribute removes the Job anyways).
How can I define an initContainer to run when I deploy my app so I can push my database migrations in a Job, but also allow HPA to control dynamically adding a Pod without needing an initContainer?
Here's what my deployment looks like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: graphql-deployment
spec:
replicas: 1
selector:
matchLabels:
app: graphql-pod
template:
metadata:
labels:
app: graphql-pod
spec:
initContainers:
- name: wait-for-graphql-migration-job
image: groundnuty/k8s-wait-for:v1.4 # This is an image that waits for a process to complete
args:
- job
- graphql-migration-job # this job is defined next
containers:
- name: graphql-container
image: image(graphql):tag(graphql)
The following Job is also deployed
apiVersion: batch/v1
kind: Job
metadata:
name: graphql-migration-job
spec:
ttlSecondsAfterFinished: 30
template:
spec:
containers:
- name: graphql-migration-container
image: image(graphql):tag(graphql)
command: ["npm", "run", "migrate:reset"]
restartPolicy: Never
So basically what happens is:
I deploy these two resources to my node
Job is initialized
initContainer on Pod waits for Job to complete using an image called groundnuty/k8s-wait-for:v1.4
Job completes
initContainer completes
Pod initializes
(after 30 TTL seconds) Job is removed from node
(LOTS OF TRAFFIC)
HPA realizes a need for another pod
initContainer for NEW pod is started, but cant run because Job doesn't exist
...crashLoopBackOff
Would love any insight on the proper way to handle this scenario!

There is, unfortunately, no simple Kubernetes feature to resolve your issue.
I recommend extending your deployment tooling/scripts to separate the migration job and your deployment. During the deploy process, you first execute the migration job and then deploy your deployment. Without the job attached, the HPA can nicely scale your pods.
There is a multitude of ways to achieve this:
Have a bash, etc. script first to execute the job, wait and then update your deployment
Leverage more complex deployment tooling like Helm, which allows you to add a 'pre-install hook' to your job to execute them when you deploy your application

Related

AKS with Keda: pod are removed during execution

I tried Keda with AKS and I really appreciate when pod are automatically instanciate based on Azure Dev Ops queue job for release & build.
However I noticed something strange and often AKS/Keda remove pod while processing which makes workflow failed.
Message reads
We stopped hearing from agent aks-linux-768d6647cc-ntmh4. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Expected behavior: pods must complete the tasks then Keda/AKS can remove this pod.
I share with you my keda yml file
# deployment.yaml
apiVersion: apps/v1 # The API resource where this workload resides
kind: Deployment # The kind of workload we're creating
metadata:
name: aks-linux # This will be the name of the deployment
spec:
selector: # Define the wrapping strategy
matchLabels: # Match all pods with the defined labels
app: aks-linux # Labels follow the `name: value` template
replicas: 1
template: # This is the template of the pod inside the deployment
metadata: # Metadata for the pod
labels:
app: aks-linux
spec:
nodeSelector:
agentpool: linux
containers: # Here we define all containers
- image: <My image here>
name: aks-linux
env:
- name: "AZP_URL"
value: "<myURL>"
- name: "AZP_TOKEN"
value: "<MyToken>"
- name: "AZP_POOL"
value: "<MyPool>"
resources:
requests: # Minimum amount of resources requested
cpu: 2
memory: 4096Mi
limits: # Maximum amount of resources requested
cpu: 4
memory: 8192Mi
I used latest version of AKS and Keda. Any idea ?
Check the official Keda docs:
When running your agents as a deployment you have no control on which pod gets killed when scaling down.
So, to solve it you need to use ScaledJob:
If you run your agents as a Job, KEDA will start a Kubernetes job for each job that is in the agent pool queue. The agents will accept one job when they are started and terminate afterwards. Since an agent is always created for every pipeline job, you can achieve fully isolated build environments by using Kubernetes jobs.
See there how to implement it.

How to automatically restart pods when a new image ready

I'm using K8s on GCP.
Here is my deployment file
apiVersion: apps/v1
kind: Deployment
metadata:
name: simpleapp-direct
labels:
app: simpleapp-direct
role: backend
stage: test
spec:
replicas: 1
selector:
matchLabels:
app: simpleapp-direct
version: v0.0.1
template:
metadata:
labels:
app: simpleapp-direct
version: v0.0.1
spec:
containers:
- name: simpleapp-direct
image: gcr.io/applive/simpleapp-direct:latest
imagePullPolicy: Always
I first apply the deployment file with kubectl apply command
kubectl apply -f deployment.yaml
The pods were created properly.
I was expecting that every time I would push a new image with the tag latest, the pods would be automatically killed and restart using the new images.
I tried the rollout command
kubectl rollout restart deploy simpleapp-direct
The pods restart as I wanted.
However, I don't want to run this command every time there is a new latest build.
How can I handle this situation ?.
Thanks a lot
Try to use image hash instead of tag in your Pod Definition.
Generally: there is no way to automatically restart pods when the new image is ready. It is generally advisable not to use image:latest (or just image_name) in Kubernetes as it can cause difficulties with rollback of your deployment. Also you need to make sure that the flag: imagePullPolicy is set to Always. Normally when you use CI/CD or git-ops your deployment is updated automatically by these tools when the new image is ready and passed thru the tests.
When your Docker image is updated, you need to setup a trigger on this update within your CI/CD pipilne to re-run the deployment. Not sure about the base system/image where you build your docker image, but you can add there kubernetes certificates and run the above commands like you do on your local computer.

Run containers which intentionally exit periodically

How can I have Kubernates automatically restart a container which purposefully exits in order to get new data from environment variables?
I have a container running on a Kubernates cluster which operates in the following fashion:
Container starts, polls for work
If it receives a task, it does some work
It polls for work again, until ...
.. the container has been running for over a certain period of time, after which it exits instead of polling for more work.
It needs to be continually restarted, as it uses environment variables which are populated by Kubernates secrets which are periodically refreshed by another process.
I've tried a Deployment, but it doesn't seem like the right fit as I get CrashLoopBackOff status, which means the worker is scheduled less and less often.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-fonky-worker
labels:
app: my-fonky-worker
spec:
replicas: 2
selector:
matchLabels:
app: my-fonky-worker
template:
metadata:
labels:
app: my-fonky-worker
spec:
containers:
- name: my-fonky-worker-container
image: my-fonky-worker:latest
env:
- name: NOTSOSECRETSTUFF
value: cats_are_great
- name: SECRETSTUFF
valueFrom:
secretKeyRef:
name: secret-name
key: secret-key
I've also tried a CronJob, but that seems a bit hacky as it could mean that the container is left in the stopped state for several seconds.
As #Josh said you need to exit with exit 0 else it will be treated as a failed container! Here is the reference According to the first example there "Pod is running and has one Container. Container exits with success." if your restartPolicy is set to Always (which is default by the way) then the container will restart although the Pod status shows running but if you log the pod then you can see the restart of the container.
It needs to be continually restarted, as it uses environment variables which are populated by Kubernates secrets which are periodically refreshed by another process.
I would take a different approach to this. I would mount the config map as explained here this will automatically refresh the Mounted config maps data Ref. Note: please take care of the " kubelet sync period (1 minute by default) + ttl of ConfigMaps cache (1 minute by default) in kubelet" to manage the refresh rate of configmap data in the Pod.
What I see as a solution for this would be to run your container as a cronjob. but don't use startingDeadlineSeconds as your container killer.
It runs on its schedule.
In your container you can have it poll for work N times.
After N times it exits 0.
If I understood correctly in your example there are 2 problems:
Restarting container
Updating secret values
In order to keep your secrets up to date you should consider using secrets as described by Amit Kumar Gupta comment and mount secrets as volume instead of environment variable, here is an example.
As per the second problem with restarting container it depends on what is the exit code as described by garlicFrancium
From another point of view you can use init container waiting for new tasks and main container in order to proceed this tasks according to your requirements or create job scheduler.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 1
selector:
matchLabels:
app: complete
template:
metadata:
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
restartPolicy: Always
Please note:
When a secret being already consumed in a volume is updated, projected keys are eventually updated as well. Kubelet is checking whether the mounted secret is fresh on every periodic sync. However, it is using its local cache for getting the current value of the Secret.
The type of the cache is configurable using the (ConfigMapAndSecretChangeDetectionStrategy field in KubeletConfiguration struct). It can be either propagated via watch (default), ttl-based, or simply redirecting all requests to directly kube-apiserver. As a result, the total delay from the moment when the Secret is updated to the moment when new keys are projected to the Pod can be as long as kubelet sync period + cache propagation delay, where cache propagation delay depends on the chosen cache type (it equals to watch propagation delay, ttl of cache, or zero corespondingly).
A container using a Secret as a subPath volume mount will not receive Secret updates.
Please refer also to:
Fine Parallel Processing Using a Work Queue

Is there a way to downscale pods only when message is processed (the pod finished its task) with the HorizontalPodAutoscaler in Kubernetes?

I'v setup Kubernetes Horizontal Pod Autoscaler with custom metrics using the prometheus adapter https://github.com/DirectXMan12/k8s-prometheus-adapter. Prometheus is monitoring rabbitmq, and Im watching the rabbitmq_queue_messages metric. The messages from the queue are picked up by the pods, that then do some processing, which can last for several hours.
The scale-up and scale-down is working based on the number of messages in the queue.
The problem:
When a pod finishes the processing and acks the message, that will lower the num. of messages in the queue, and that would trigger the Autoscaler terminate a pod. If I have multipe pods doing the processing and one of them finishes, if Im not mistaking, Kubernetes could terminate a pod that is still doing the processing of its own message. This wouldnt be desirable as all the processing that the pod is doing would be lost.
Is there a way to overcome this, or another way how this could be acheveed?
here is the Autoscaler configuration:
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
name: sample-app-rabbitmq
namespace: monitoring
spec:
scaleTargetRef:
# you created above
apiVersion: apps/v1
kind: Deployment
name: sample-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Object
object:
target:
kind: Service
name: rabbitmq-cluster
metricName: rabbitmq_queue_messages_ready
targetValue: 5
You could consider approach using preStop hook.
As per documentation Container States, Define postStart and preStop handlers:
Before a container enters into Terminated, preStop hook (if any) is executed.
So you can use in your deployment:
lifecycle:
preStop:
exec:
command: ["your script"]
### update:
I would like to provide more information due to some research:
There is an interesting project:
KEDA allows for fine grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.
KEDA can run on both the cloud and the edge, integrates natively with Kubernetes components such as the Horizontal Pod Autoscaler, and has no external dependencies.
For the main question "Kubernetes could terminate a pod that is still doing the processing of its own message".
As per documentation:
"Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features"
Deployment is backed by Replicaset. As per this controller code there exist function "getPodsToDelete". In combination with "filteredPods" it gives the result: "This ensures that we delete pods in the earlier stages whenever possible."
So as proof of concept:
You can create deployment with init container. Init container should check if there is a message in the queue and exit when at least one message appears. This will allow main container to start, take and process that message. In this case we will have two kinds of pods - those which process the message and consume CPU and those who are in the starting state, idle and waiting for the next message. In this case starting containers will be deleted at the first place when HPA decide to decrease number of replicas in the deployment.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: complete
template:
metadata:
creationTimestamp: null
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
dnsPolicy: ClusterFirst
restartPolicy: Always
terminationGracePeriodSeconds: 30
Hope this help.
Horizontal Pod Autoscaler is not designed for long-running tasks, and will not be a good fit. If you need to spawn one long-running processing tasks per message, I'd take one of these two approaches:
Use a task queue such as Celery. It is designed to solve your exact problem: have a queue of tasks that needs to be distributed to workers, and ensure that the tasks run to completion. Kubernetes even provides an official example of this setup.
If you don't want to introduce another component such as Celery, you can spawn a Kubernetes job for every incoming message by yourself. Kubernetes will make sure that the job runs to completion at least once - reschedule the pod if it dies, etc. In this case you will need to write a script that reads RabbitMQ messages and creates jobs for them by yourself.
In both cases, make sure you also have Cluster Autoscaler enabled so that new nodes get automatically provisioned if your current nodes are not sufficient to handle the load.

How to create deployments with more than one pod?

I'm hosting an application on the Google Cloud Platform via Kubernetes, and I've managed to set up this continuous deployment pipeline:
Application code is updated
New Docker image is automatically generated
K8s Deployment is automatically updated to use the new image
This works great, except for one issue - the deployment always seems to have only one pod. Because of this, when the next update cycle comes around, the entire application goes down, which is unacceptable.
I've tried modifying the YAML of the deployment to increase the number of replicas, and it works... until the next image update, where it gets reset back to one pod again.
This is the command I use to update the image deployment:
set image deployment foo-server gcp-cd-foo-server-sha256=gcr.io/project-name/gcp-cd-foo-server:$REVISION_ID
You can use this command if you dont want to edit deployment yaml file:
kubectl scale deployment foo-server --replicas=2
Also, look at update strategy with maxUnavailable and maxsurge properties.
In your orgional deployment.yml file keep the replicas to 2 or more, othervise you cant avoid down time if only one pod is running and you are going to re-deploy/upgrade etc.
Deployment with 3 replicas( example):
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
Deployment can ensure that only a certain number of Pods may be down
while they are being updated. By default, it ensures that at least 25%
less than the desired number of Pods are up (25% max unavailable).
Deployment can also ensure that only a certain number of Pods may be
created above the desired number of Pods. By default, it ensures that
at most 25% more than the desired number of Pods are up (25% max
surge).
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
Nevermind, I had just set up my deployments wrong - had something to do with using the GCP user interface to create the deployments rather than console commands. I created the deployments with kubectl run app --image ... instead and it works now.