AKS with Keda: pod are removed during execution - azure-devops

I tried Keda with AKS and I really appreciate when pod are automatically instanciate based on Azure Dev Ops queue job for release & build.
However I noticed something strange and often AKS/Keda remove pod while processing which makes workflow failed.
Message reads
We stopped hearing from agent aks-linux-768d6647cc-ntmh4. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Expected behavior: pods must complete the tasks then Keda/AKS can remove this pod.
I share with you my keda yml file
# deployment.yaml
apiVersion: apps/v1 # The API resource where this workload resides
kind: Deployment # The kind of workload we're creating
metadata:
name: aks-linux # This will be the name of the deployment
spec:
selector: # Define the wrapping strategy
matchLabels: # Match all pods with the defined labels
app: aks-linux # Labels follow the `name: value` template
replicas: 1
template: # This is the template of the pod inside the deployment
metadata: # Metadata for the pod
labels:
app: aks-linux
spec:
nodeSelector:
agentpool: linux
containers: # Here we define all containers
- image: <My image here>
name: aks-linux
env:
- name: "AZP_URL"
value: "<myURL>"
- name: "AZP_TOKEN"
value: "<MyToken>"
- name: "AZP_POOL"
value: "<MyPool>"
resources:
requests: # Minimum amount of resources requested
cpu: 2
memory: 4096Mi
limits: # Maximum amount of resources requested
cpu: 4
memory: 8192Mi
I used latest version of AKS and Keda. Any idea ?

Check the official Keda docs:
When running your agents as a deployment you have no control on which pod gets killed when scaling down.
So, to solve it you need to use ScaledJob:
If you run your agents as a Job, KEDA will start a Kubernetes job for each job that is in the agent pool queue. The agents will accept one job when they are started and terminate afterwards. Since an agent is always created for every pipeline job, you can achieve fully isolated build environments by using Kubernetes jobs.
See there how to implement it.

Related

What is the relationship between scheduling policies and scheduling Configuration in k8s

The k8s scheduling implementation comes in two forms: Scheduling Policies and Scheduling Profiles.
What is the relationship between the two? They seem to overlap but have some differences. For example, there is a NodeUnschedulable in the profiles but not in the policy. CheckNodePIDPressure is in the policy, but not in the profiles
In addition, there is a default startup option in the scheduling configuration, but it is not specified in the scheduling policy. How can I know about the default startup policy?
I really appreciate any help with this.
The difference is simple: Kubernetes does't support 'Scheduling policies' from v1.19 or later. Kubernetes v1.19 supports configuring multiple scheduling policies with a single scheduler. We are using this to define a bin-packing scheduling policy in all v1.19 clusters by default. 'Scheduling profiles' can be used to define a bin-packing scheduling policy in all v1.19 clusters by default.
To use that scheduling policy, all that is required is to specify the scheduler name bin-packing-scheduler in the Pod spec. For example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 5
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
schedulerName: bin-packing-scheduler
containers:
- name: nginx
image: nginx:1.17.8
resources:
requests:
cpu: 200m
The pods of this deployment will be scheduled onto the nodes which already have the highest resource utilisation, to optimise for autoscaling or ensuring efficient pod placement when mixing large and small pods in the same cluster.
If a scheduler name is not specified then the default spreading algorithm will be used to distribute pods across all nodes.

Horizontal Pod Autoscaling (HPA) with an initContainer that requires a Job

I have a specific scenario where I'd like to have a deployment controlled by horizontal pod autoscaling. To handle database migrations in pods when pushing a new deployment, I followed this excellent tutorial by Andrew Lock here.
In short, you must define an initContainer that waits for a Kubernetes Job to complete a process (like running db migrations) before the new pods can run.
This works well, however, I'm not sure how to handle HPA after the initial deployment because if the system detects the need to add another Pod in my node, the initContainer defined in my deployment requires a Job to be deployed and run, but since Jobs are one-off processes, the pod can not initialize and run properly (a ttlSecondsAfterFinished attribute removes the Job anyways).
How can I define an initContainer to run when I deploy my app so I can push my database migrations in a Job, but also allow HPA to control dynamically adding a Pod without needing an initContainer?
Here's what my deployment looks like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: graphql-deployment
spec:
replicas: 1
selector:
matchLabels:
app: graphql-pod
template:
metadata:
labels:
app: graphql-pod
spec:
initContainers:
- name: wait-for-graphql-migration-job
image: groundnuty/k8s-wait-for:v1.4 # This is an image that waits for a process to complete
args:
- job
- graphql-migration-job # this job is defined next
containers:
- name: graphql-container
image: image(graphql):tag(graphql)
The following Job is also deployed
apiVersion: batch/v1
kind: Job
metadata:
name: graphql-migration-job
spec:
ttlSecondsAfterFinished: 30
template:
spec:
containers:
- name: graphql-migration-container
image: image(graphql):tag(graphql)
command: ["npm", "run", "migrate:reset"]
restartPolicy: Never
So basically what happens is:
I deploy these two resources to my node
Job is initialized
initContainer on Pod waits for Job to complete using an image called groundnuty/k8s-wait-for:v1.4
Job completes
initContainer completes
Pod initializes
(after 30 TTL seconds) Job is removed from node
(LOTS OF TRAFFIC)
HPA realizes a need for another pod
initContainer for NEW pod is started, but cant run because Job doesn't exist
...crashLoopBackOff
Would love any insight on the proper way to handle this scenario!
There is, unfortunately, no simple Kubernetes feature to resolve your issue.
I recommend extending your deployment tooling/scripts to separate the migration job and your deployment. During the deploy process, you first execute the migration job and then deploy your deployment. Without the job attached, the HPA can nicely scale your pods.
There is a multitude of ways to achieve this:
Have a bash, etc. script first to execute the job, wait and then update your deployment
Leverage more complex deployment tooling like Helm, which allows you to add a 'pre-install hook' to your job to execute them when you deploy your application

How to do controlled rollout using Kubernetes deployment

We have 1000 store nodes and need to deploy an application image on every kubernetes node by rolling out in the below order and would like to specify the deployment node details during the deployment. Is there a way to specify node details in the command line when we execute kubectl create or apply deployment commands?
This application image would be configured to store/node specific details during container/POD creation.
1 node on day 1,
10 node on day 2,
100 node on day 3 etc.
Answering on the question from the title:
How to do controlled rollout using Kubernetes deployment
You can create a Deployment that will have specific fields in its manifest that will configure the way Kubernetes handles it.
With the fields like: podAntiAffinity, requiredDuringSchedulingIgnoredDuringExecution you can ensure that Kubernetes will distribute the Pods equally across the cluster Nodes. You can read more about it by following below documentation:
Kubernetes.io: Docs: Concepts: Scheduling eviction: Assign Pod to Node
Having in mind following rollout schedule:
DAY
REPLICAS_COUNT
1
1
2
10
3
100
4
1000
You could use CI/CD tools (like for example Jenkins) to rollout (change) the amount of replicas of your Deployment across a specific schedule.
You could create a Jenkins pipeline with a deploy stage where you could put your own command with it's scheduler (or delay).
The example of such Deployment that could be used with Jenkins is following:
cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
selector:
matchLabels:
app: nginx
replicas: ${REPLICAS_COUNT}
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
containers:
- name: nginx
image: nginx
EOF
This Deployment will assign Pods to the Nodes that aren't already having an already running replica of this Deployment (i.e. 1 Pod = 1 Node). If the amount of Pods exceeds the amount of Nodes they will remain in Pending state.
Additional resources:
Jenkins.io: Doc: Pipeline: Tour: Environment
Kubernetes.io: Docs: Concepts: Workloads: Controllers: Deployment

kubernetes imagePullPolicy:Always is not pulling image automatically

I want that every time I create a new image with the tag latest Kubernetes automatically pull the new image. I added imagePullPolicy: Always in pod spec but it doesn't update the old image with new image.
apiVersion: apps/v1
kind: Deployment
metadata:
name: node
namespace: dev
labels:
app: my-node-app
spec:
replicas: 2
selector:
matchLabels:
app: my-node-app
template:
metadata:
labels:
app: my-node-app
spec:
hostNetwork: true
securityContext:
fsGroup: 1000
containers:
- name: node
imagePullPolicy: Always
image: gcr.io/my-repo/my-node-app:latest
ports:
- containerPort: 3000
envFrom:
- configMapRef:
name: my-configmap
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 2
memory: 8Gi
restartPolicy: Always
imagePullPolicy is only taken into account by Kubernetes when a POD is created or re-started. It is NOT taken into account while a POD is running, which means it does NOT check for image updates at any time while a POD is running.
Even if another POD with the same image would be scheduled onto the same Kubernetes node, the already running POD is not affected, even though Kubernetes does a pull and then uses the new image for the new POD.
If you want the desired functionality, you will have to implement your own solution. You could do this by implementing a sidecar that regularly checks the Docker Repository for changes to the given tag. When it detects such a change, it could trigger a restart of the POD, which would then force the image to be re-pulled.
A restart of the POD can either be triggered by simply exiting the sidecar or by utilizing the Kubernetes API inside the sidecar. The latter solution however gets more complicated as you will also need service accounts and RBAC rules to get proper permissions inside the sidecar container. It also has security implications you'd have to give the whole POD escalated permissions.
Setting imagePullPolicy: Always does not mean an image will be pulled automatically without any trigger.
I would recommend to use tagged image with semvar. Since you are using deployment you can perform rolling update of the pod which will pull new image and rollout the change across all the replica pod one by one in a graceful way without causing any downtime.
Let's say initially the image is gcr.io/my-repo/my-node-app:v1 and you want to update it to v2
kubectl set image deployment/node nginx=gcr.io/my-repo/my-node-app:v2 --record
Check the rollout history
kubectl rollout history deployment.v1.apps/node
In case of any issue rollback to previous version
kubectl rollout undo deployment.v1.apps/node
Also if you want to be more advanced you could do GitOps using FluxCD which supports triggering a rollout automatically whenever a new version of an image is pushed to a container registry.
Kubernetes will pull image only upon Pod creation which means it does not check for image updates while a POD is in running state.
I would recommend to use Semantic Versioning for the image tag and use a CI/CD pipeline which build, tag, and push to your registry. Then use a CD tool such as keel to re-create your pods in the last step of the pipeline.

Is there a way to downscale pods only when message is processed (the pod finished its task) with the HorizontalPodAutoscaler in Kubernetes?

I'v setup Kubernetes Horizontal Pod Autoscaler with custom metrics using the prometheus adapter https://github.com/DirectXMan12/k8s-prometheus-adapter. Prometheus is monitoring rabbitmq, and Im watching the rabbitmq_queue_messages metric. The messages from the queue are picked up by the pods, that then do some processing, which can last for several hours.
The scale-up and scale-down is working based on the number of messages in the queue.
The problem:
When a pod finishes the processing and acks the message, that will lower the num. of messages in the queue, and that would trigger the Autoscaler terminate a pod. If I have multipe pods doing the processing and one of them finishes, if Im not mistaking, Kubernetes could terminate a pod that is still doing the processing of its own message. This wouldnt be desirable as all the processing that the pod is doing would be lost.
Is there a way to overcome this, or another way how this could be acheveed?
here is the Autoscaler configuration:
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
name: sample-app-rabbitmq
namespace: monitoring
spec:
scaleTargetRef:
# you created above
apiVersion: apps/v1
kind: Deployment
name: sample-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Object
object:
target:
kind: Service
name: rabbitmq-cluster
metricName: rabbitmq_queue_messages_ready
targetValue: 5
You could consider approach using preStop hook.
As per documentation Container States, Define postStart and preStop handlers:
Before a container enters into Terminated, preStop hook (if any) is executed.
So you can use in your deployment:
lifecycle:
preStop:
exec:
command: ["your script"]
### update:
I would like to provide more information due to some research:
There is an interesting project:
KEDA allows for fine grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.
KEDA can run on both the cloud and the edge, integrates natively with Kubernetes components such as the Horizontal Pod Autoscaler, and has no external dependencies.
For the main question "Kubernetes could terminate a pod that is still doing the processing of its own message".
As per documentation:
"Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features"
Deployment is backed by Replicaset. As per this controller code there exist function "getPodsToDelete". In combination with "filteredPods" it gives the result: "This ensures that we delete pods in the earlier stages whenever possible."
So as proof of concept:
You can create deployment with init container. Init container should check if there is a message in the queue and exit when at least one message appears. This will allow main container to start, take and process that message. In this case we will have two kinds of pods - those which process the message and consume CPU and those who are in the starting state, idle and waiting for the next message. In this case starting containers will be deleted at the first place when HPA decide to decrease number of replicas in the deployment.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: complete
template:
metadata:
creationTimestamp: null
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
dnsPolicy: ClusterFirst
restartPolicy: Always
terminationGracePeriodSeconds: 30
Hope this help.
Horizontal Pod Autoscaler is not designed for long-running tasks, and will not be a good fit. If you need to spawn one long-running processing tasks per message, I'd take one of these two approaches:
Use a task queue such as Celery. It is designed to solve your exact problem: have a queue of tasks that needs to be distributed to workers, and ensure that the tasks run to completion. Kubernetes even provides an official example of this setup.
If you don't want to introduce another component such as Celery, you can spawn a Kubernetes job for every incoming message by yourself. Kubernetes will make sure that the job runs to completion at least once - reschedule the pod if it dies, etc. In this case you will need to write a script that reads RabbitMQ messages and creates jobs for them by yourself.
In both cases, make sure you also have Cluster Autoscaler enabled so that new nodes get automatically provisioned if your current nodes are not sufficient to handle the load.