Is there a way to downscale pods only when message is processed (the pod finished its task) with the HorizontalPodAutoscaler in Kubernetes? - kubernetes

I'v setup Kubernetes Horizontal Pod Autoscaler with custom metrics using the prometheus adapter https://github.com/DirectXMan12/k8s-prometheus-adapter. Prometheus is monitoring rabbitmq, and Im watching the rabbitmq_queue_messages metric. The messages from the queue are picked up by the pods, that then do some processing, which can last for several hours.
The scale-up and scale-down is working based on the number of messages in the queue.
The problem:
When a pod finishes the processing and acks the message, that will lower the num. of messages in the queue, and that would trigger the Autoscaler terminate a pod. If I have multipe pods doing the processing and one of them finishes, if Im not mistaking, Kubernetes could terminate a pod that is still doing the processing of its own message. This wouldnt be desirable as all the processing that the pod is doing would be lost.
Is there a way to overcome this, or another way how this could be acheveed?
here is the Autoscaler configuration:
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
name: sample-app-rabbitmq
namespace: monitoring
spec:
scaleTargetRef:
# you created above
apiVersion: apps/v1
kind: Deployment
name: sample-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Object
object:
target:
kind: Service
name: rabbitmq-cluster
metricName: rabbitmq_queue_messages_ready
targetValue: 5

You could consider approach using preStop hook.
As per documentation Container States, Define postStart and preStop handlers:
Before a container enters into Terminated, preStop hook (if any) is executed.
So you can use in your deployment:
lifecycle:
preStop:
exec:
command: ["your script"]
### update:
I would like to provide more information due to some research:
There is an interesting project:
KEDA allows for fine grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.
KEDA can run on both the cloud and the edge, integrates natively with Kubernetes components such as the Horizontal Pod Autoscaler, and has no external dependencies.
For the main question "Kubernetes could terminate a pod that is still doing the processing of its own message".
As per documentation:
"Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features"
Deployment is backed by Replicaset. As per this controller code there exist function "getPodsToDelete". In combination with "filteredPods" it gives the result: "This ensures that we delete pods in the earlier stages whenever possible."
So as proof of concept:
You can create deployment with init container. Init container should check if there is a message in the queue and exit when at least one message appears. This will allow main container to start, take and process that message. In this case we will have two kinds of pods - those which process the message and consume CPU and those who are in the starting state, idle and waiting for the next message. In this case starting containers will be deleted at the first place when HPA decide to decrease number of replicas in the deployment.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: complete
template:
metadata:
creationTimestamp: null
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
dnsPolicy: ClusterFirst
restartPolicy: Always
terminationGracePeriodSeconds: 30
Hope this help.

Horizontal Pod Autoscaler is not designed for long-running tasks, and will not be a good fit. If you need to spawn one long-running processing tasks per message, I'd take one of these two approaches:
Use a task queue such as Celery. It is designed to solve your exact problem: have a queue of tasks that needs to be distributed to workers, and ensure that the tasks run to completion. Kubernetes even provides an official example of this setup.
If you don't want to introduce another component such as Celery, you can spawn a Kubernetes job for every incoming message by yourself. Kubernetes will make sure that the job runs to completion at least once - reschedule the pod if it dies, etc. In this case you will need to write a script that reads RabbitMQ messages and creates jobs for them by yourself.
In both cases, make sure you also have Cluster Autoscaler enabled so that new nodes get automatically provisioned if your current nodes are not sufficient to handle the load.

Related

AKS with Keda: pod are removed during execution

I tried Keda with AKS and I really appreciate when pod are automatically instanciate based on Azure Dev Ops queue job for release & build.
However I noticed something strange and often AKS/Keda remove pod while processing which makes workflow failed.
Message reads
We stopped hearing from agent aks-linux-768d6647cc-ntmh4. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Expected behavior: pods must complete the tasks then Keda/AKS can remove this pod.
I share with you my keda yml file
# deployment.yaml
apiVersion: apps/v1 # The API resource where this workload resides
kind: Deployment # The kind of workload we're creating
metadata:
name: aks-linux # This will be the name of the deployment
spec:
selector: # Define the wrapping strategy
matchLabels: # Match all pods with the defined labels
app: aks-linux # Labels follow the `name: value` template
replicas: 1
template: # This is the template of the pod inside the deployment
metadata: # Metadata for the pod
labels:
app: aks-linux
spec:
nodeSelector:
agentpool: linux
containers: # Here we define all containers
- image: <My image here>
name: aks-linux
env:
- name: "AZP_URL"
value: "<myURL>"
- name: "AZP_TOKEN"
value: "<MyToken>"
- name: "AZP_POOL"
value: "<MyPool>"
resources:
requests: # Minimum amount of resources requested
cpu: 2
memory: 4096Mi
limits: # Maximum amount of resources requested
cpu: 4
memory: 8192Mi
I used latest version of AKS and Keda. Any idea ?
Check the official Keda docs:
When running your agents as a deployment you have no control on which pod gets killed when scaling down.
So, to solve it you need to use ScaledJob:
If you run your agents as a Job, KEDA will start a Kubernetes job for each job that is in the agent pool queue. The agents will accept one job when they are started and terminate afterwards. Since an agent is always created for every pipeline job, you can achieve fully isolated build environments by using Kubernetes jobs.
See there how to implement it.

Kubernetes start pods in batches for a node

For some application,start or restart need more resources than running。for exapmle:es/flink。if a node have network jitter,all pods would restart at the same time in this node。When this happens,cpu usage becomes very high in this node。it would increase resource competition in this node。
now i want to start pods in batches for only one node。how to realize the function now?
Kubernetes have auto-healing
You can let the POD crash and Kubernetes will auto re-start them soon as get the sufficient memory or resource requirement
Or else if you want to put the wait somehow so that deployment wait and gradually start one by one
you can use the sidecar and use the POD lifecycle hooks to start the main container, this process is not best but can resolve your issue.
Basic example :
apiVersion: v1
kind: Pod
metadata:
name: sidecar-starts-first
spec:
containers:
- name: sidecar
image: my-sidecar
lifecycle:
postStart:
exec:
command:
- /bin/wait-until-ready.sh
- name: application
image: my-application
OR
You can also use the Init container to check the other container's health and start the main container POD once one POD is of another service.
Init container
i would also recommend to check the Priority class : https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/

Horizontal Pod Autoscaling (HPA) with an initContainer that requires a Job

I have a specific scenario where I'd like to have a deployment controlled by horizontal pod autoscaling. To handle database migrations in pods when pushing a new deployment, I followed this excellent tutorial by Andrew Lock here.
In short, you must define an initContainer that waits for a Kubernetes Job to complete a process (like running db migrations) before the new pods can run.
This works well, however, I'm not sure how to handle HPA after the initial deployment because if the system detects the need to add another Pod in my node, the initContainer defined in my deployment requires a Job to be deployed and run, but since Jobs are one-off processes, the pod can not initialize and run properly (a ttlSecondsAfterFinished attribute removes the Job anyways).
How can I define an initContainer to run when I deploy my app so I can push my database migrations in a Job, but also allow HPA to control dynamically adding a Pod without needing an initContainer?
Here's what my deployment looks like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: graphql-deployment
spec:
replicas: 1
selector:
matchLabels:
app: graphql-pod
template:
metadata:
labels:
app: graphql-pod
spec:
initContainers:
- name: wait-for-graphql-migration-job
image: groundnuty/k8s-wait-for:v1.4 # This is an image that waits for a process to complete
args:
- job
- graphql-migration-job # this job is defined next
containers:
- name: graphql-container
image: image(graphql):tag(graphql)
The following Job is also deployed
apiVersion: batch/v1
kind: Job
metadata:
name: graphql-migration-job
spec:
ttlSecondsAfterFinished: 30
template:
spec:
containers:
- name: graphql-migration-container
image: image(graphql):tag(graphql)
command: ["npm", "run", "migrate:reset"]
restartPolicy: Never
So basically what happens is:
I deploy these two resources to my node
Job is initialized
initContainer on Pod waits for Job to complete using an image called groundnuty/k8s-wait-for:v1.4
Job completes
initContainer completes
Pod initializes
(after 30 TTL seconds) Job is removed from node
(LOTS OF TRAFFIC)
HPA realizes a need for another pod
initContainer for NEW pod is started, but cant run because Job doesn't exist
...crashLoopBackOff
Would love any insight on the proper way to handle this scenario!
There is, unfortunately, no simple Kubernetes feature to resolve your issue.
I recommend extending your deployment tooling/scripts to separate the migration job and your deployment. During the deploy process, you first execute the migration job and then deploy your deployment. Without the job attached, the HPA can nicely scale your pods.
There is a multitude of ways to achieve this:
Have a bash, etc. script first to execute the job, wait and then update your deployment
Leverage more complex deployment tooling like Helm, which allows you to add a 'pre-install hook' to your job to execute them when you deploy your application

kubernetes - container busy with request, then request should route to another container

I am very new to k8s and docker. But I have task on k8s. Now I stuck with a use case. That is:
If a container is busy with requests. Then incoming request should redirect to another container.
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: twopoddeploy
namespace: twopodns
spec:
selector:
matchLabels:
app: twopod
replicas: 1
template:
metadata:
labels:
app: twopod
spec:
containers:
- name: secondcontainer
image: "docker.io/tamilpugal/angmanualbuild:latest"
env:
- name: "PORT"
value: "24244"
- name: firstcontainer
image: "docker.io/tamilpugal/angmanualbuild:latest"
env:
- name: "PORT"
value: "24243"
service.yaml
apiVersion: v1
kind: Service
metadata:
name: twopodservice
spec:
type: NodePort
selector:
app: twopod
ports:
- nodePort: 31024
protocol: TCP
port: 82
targetPort: 24243
From deployment.yaml, I created a pod with two containers with same image. Because, firstcontainer is not reachable/ busy, then secondcontainer should handles the incoming requests. This our idea and use case. So (only for checking our use case) I delete firstcontainer using docker container rm -f id_of_firstcontainer. Now site is not reachable until, docker recreates the firstcontainer. But I need k8s should redirects the requests to secondcontainer instead of waiting for firstcontainer.
Then, I googled about the solution, I found Ingress and Liveness - Readiness. But Ingress does route the request based on the path instead of container status. Liveness - Readiness also recreates the container. SO also have some question and some are using ngix server. But no luck. That why I create a new question.
So my question is how to configure the two containers to reduce the downtime?
what is keyword for get the solution from the google to try myself?
Thanks,
Pugal.
A service can load-balance between multiple pods. You should delete the second copy of the container inside the deployment spec, but also change it to have replicas: 2. Now your deployment will launch two identical pods, but the service will match both of them, and requests will go to both.
apiVersion: apps/v1
kind: Deployment
metadata:
name: onepoddeploy
namespace: twopodns
spec:
selector: { ... }
replicas: 2 # not 1
template:
metadata: { ... }
spec:
containers:
- name: firstcontainer
image: "docker.io/tamilpugal/angmanualbuild:latest"
env:
- name: "PORT"
value: "24243"
# no secondcontainer
This means if two pods aren't enough to handle your load, you can kubectl scale deployment -n twopodns onepoddeploy --replicas=3 to increase the replica count. If you can tell from CPU utilization or another metric when you're getting to "not enough", you can configure the horizontal pod autoscaler to adjust the replica count for you.
(You will usually want only one container per pod. There's no way to share traffic between containers in the way you're suggesting here, and it's helpful to be able to independently scale components. This is doubly true if you're looking at a stateful component like a database and a stateless component like an HTTP service. Typical uses for multiple containers are things like log forwarders and network proxies, that are somewhat secondary to the main operation of the pod, and can scale or be terminated along with the primary container.)
There are two important caveats to running multiple pod replicas behind a service. The service load balancer isn't especially clever, so if one of your replicas winds up working on intensive jobs and the other is more or less idle, they'll still each get about half the requests. Also, if your pods are configure for HTTP health checks (recommended), if a pod is backed up to the point where it can't handle requests, it will also not be able to answer its health checks, and Kubernetes will kill it off.
You can help Kubernetes here by trying hard to answer all HTTP requests "promptly" (aiming for under 1000 ms always is probably a good target). This can mean returning a "not ready yet" response to a request that triggers a large amount of computation. This can also mean rearranging your main request handler so that an HTTP request thread isn't tied up waiting for some task to complete.

Have kube jobs start on waiting pods

I am working on a scenario where I want to be able to maintain some X number of pods in waiting (and managed by kube) and then upon user request (via some external system) have a kube job start on one of those waiting pods. So now the waiting pods count is X-1 and kube starts another pod to bring this number back to X.
This way I'll be able to cut down on the time taken to create a pod, start a container and getting is ready to start actual processing. The processing data can be sent to those pods via some sort of messaging (akka or rabbitmq).
I think the ReplicationControllers are best place to keep idle pods, but when I create a job how can I specify that I want to be able to use one of the pods that are in waiting and are managed by ReplicationController.
I think I got this to work upto a state on top of which I can build this solution.
So what I am doing is starting a RC with replicas: X (X is the number of idle pods I wish to maintain, usually not a very large number). The pods that it starts have custom label status: idle or something like that. The RC spec.selector has the same custom label value to match with the pods that it manages, so spec.selector.status: idle. When creating this RC, kube ensures that it creates X pods with their status=idle. Somewhat like below:
apiVersion: v1
kind: ReplicationController
metadata:
name: testrc
spec:
replicas: 3
selector:
status: idle
template:
metadata:
name: idlepod
labels:
status: idle
spec:
containers:
...
On the other hand I have a job yaml that has spec.manualSelector: true (and yes I have taken into account that the label set has to be unique). With manualSelector enabled, I can now define selectors on the job like below.
apiVersion: batch/v1
kind: Job
metadata:
generateName: testjob-
spec:
manualSelector: true
selector:
matchLabels:
status: active
...
So clearly, RC creates pods with status=idle and job expects to use pods with status=active because of the selector.
So now whenever I have a request to start a new job, I'll update label on one of the pods managed by RC so that its status=active. The selector on RC will effect the release of this pod from its control and start another one because of replicas: X set on it. And the released pod is no longer controller by RC and is now orphan. Finally, when I create a job, the selector on this job template will match the label of the orphaned pod and this pod will then be controlled by the new job. I'll send messages to this pod that will start the actual processing and finally bring it to complete.
P.S.: Pardon my formatting. I am new here.