How to dynamically scale a service in Openshift ? A Challenging scenario - kubernetes

I'm currently trying to deploy a backend service API for my application in Openshift, which needs to be scalable such that each of the request has to run in a new pod.
Service will take 5 minutes to serve single request.
I have to hit the service for 700 times.
Is there a way I can create 700 pods to serve the 700 request and scaled down it to 1 after all the requests are completed ?
Start of the application:
1 pod <- 700 requests
Serving:
700 pod serves one request each
End of the application:
1 pod

Autoscaling in Kubernetes relies on metrics. From what I know Openshift supports CPU and Memory utilization.
But I don't think this is what you are looking for.
I think you should be looking into Jobs - Run to Completion.
Each request will spawn a new Job which will run until it's completed.
Example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
This will run a job which computes π to 2000 places and prints it out.

Related

AKS with Keda: pod are removed during execution

I tried Keda with AKS and I really appreciate when pod are automatically instanciate based on Azure Dev Ops queue job for release & build.
However I noticed something strange and often AKS/Keda remove pod while processing which makes workflow failed.
Message reads
We stopped hearing from agent aks-linux-768d6647cc-ntmh4. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Expected behavior: pods must complete the tasks then Keda/AKS can remove this pod.
I share with you my keda yml file
# deployment.yaml
apiVersion: apps/v1 # The API resource where this workload resides
kind: Deployment # The kind of workload we're creating
metadata:
name: aks-linux # This will be the name of the deployment
spec:
selector: # Define the wrapping strategy
matchLabels: # Match all pods with the defined labels
app: aks-linux # Labels follow the `name: value` template
replicas: 1
template: # This is the template of the pod inside the deployment
metadata: # Metadata for the pod
labels:
app: aks-linux
spec:
nodeSelector:
agentpool: linux
containers: # Here we define all containers
- image: <My image here>
name: aks-linux
env:
- name: "AZP_URL"
value: "<myURL>"
- name: "AZP_TOKEN"
value: "<MyToken>"
- name: "AZP_POOL"
value: "<MyPool>"
resources:
requests: # Minimum amount of resources requested
cpu: 2
memory: 4096Mi
limits: # Maximum amount of resources requested
cpu: 4
memory: 8192Mi
I used latest version of AKS and Keda. Any idea ?
Check the official Keda docs:
When running your agents as a deployment you have no control on which pod gets killed when scaling down.
So, to solve it you need to use ScaledJob:
If you run your agents as a Job, KEDA will start a Kubernetes job for each job that is in the agent pool queue. The agents will accept one job when they are started and terminate afterwards. Since an agent is always created for every pipeline job, you can achieve fully isolated build environments by using Kubernetes jobs.
See there how to implement it.

How to run a container only once that completes after ~10 minutes of execution via a deployment in Kubernetes

I have just started with Kubernetes.
I need to run a Deployment in Kubernetes with a container that competes for execution after ~10-15 minutes.
When I tried, "restart Policy=Never" doesn't hold true with Deployments.
Reason for opting for Deployment is to use Replicas.
Please provide your inputs on how I can achieve multiple replicas of my Deployment with the container that completes execution and not keep running.
You can run a Job as below where the container runs a sleep command for 15m. After 15 minutes the container will exit and pod will be terminated.
apiVersion: batch/v1
kind: Job
metadata:
name: job
spec:
template:
spec:
containers:
- command:
- sh
- -c
- sleep 15m
image: bash:5.1.0
restartPolicy: Never

How to get 1 initContainer when parallelism is set > 1

I have a job in Kubernetes as defined below (with some omissions for brevity). Parallelism and N completions are set. I want 1 init container to delay the start of the parallel containers. When I run it as-is I get an init container per each completion it appears.
apiVersion: batch/v1
kind: Job
metadata:
name: my-job
spec:
parallelism: 10
completions: 10
backoffLimit: 0
template:
spec:
# I want 1 of these to run first
initContainers:
- name: init-container
image: my/container:latest
imagePullPolicy: Always
command: ["init_script.sh"]
# I want 10 of these to run in parallel once init_script.sh exits
containers:
- name: container
image: my/container:latest
imagePullPolicy: Always
command: ["run_job.sh"]
restartPolicy: Never
Kubernetes Jobs are very basic and don't provide any advanced scheduling mechanism with scheduling dependencies. For example, the ability to start a job after another job has finished. My advice is to use a more advanced scheduling tool on top of Kubernetes. There are few in open source that you can use. For example.
Volcano - For example you can use the TaskCompleted event.
Airflow - You can schedule pods directly from the tools and build your DAG (Directed Acyclic Graph).
Etc.
Two other alternatives:
You can even create your own custom scheduler.
You can build your own operator that kicks off your regular jobs based on status.
✌️

How to specify two pods in the same job in a kubernetes yaml file?

I'm trying to do something simple, just create two pods within a job. I'm looking at the documentation here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#single-job-starts-controller-pod
While the documentation discusses parallelization it doesn't give much in the way of examples. The only example with one pod is given as:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
To create two pods I attempted effectively this:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi1
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
spec:
containers:
- name: pi2
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
But that didn't get me two pods, instead it appears that only the second container ran in one pod.
It's not clear to me how I get my job to launch multiple pods. My pods don't need to run on the same machine, but they each need a unique environment variable to be set so the pods know what part of the work to do. The work is divided in an embarrassingly parallel way and with a fixed number of pods (2 in this example).
You can not have two different pod template in the same job. From the docs here
Parallel Jobs with a work queue:
do not specify .spec.completions, default to .spec.parallelism.
the Pods must coordinate amongst themselves or an external service to
determine what each should work on. For example, a Pod might fetch a
batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all
its peers are done, and thus that the entire Job is done. when any
Pod from the Job terminates with success, no new Pods are created.
once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success.
once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting.
For a work queue Job, you must leave 
.spec.completions
 unset, and set 
.spec.parallelism
 to a non-negative integer.

Is there a way to downscale pods only when message is processed (the pod finished its task) with the HorizontalPodAutoscaler in Kubernetes?

I'v setup Kubernetes Horizontal Pod Autoscaler with custom metrics using the prometheus adapter https://github.com/DirectXMan12/k8s-prometheus-adapter. Prometheus is monitoring rabbitmq, and Im watching the rabbitmq_queue_messages metric. The messages from the queue are picked up by the pods, that then do some processing, which can last for several hours.
The scale-up and scale-down is working based on the number of messages in the queue.
The problem:
When a pod finishes the processing and acks the message, that will lower the num. of messages in the queue, and that would trigger the Autoscaler terminate a pod. If I have multipe pods doing the processing and one of them finishes, if Im not mistaking, Kubernetes could terminate a pod that is still doing the processing of its own message. This wouldnt be desirable as all the processing that the pod is doing would be lost.
Is there a way to overcome this, or another way how this could be acheveed?
here is the Autoscaler configuration:
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
name: sample-app-rabbitmq
namespace: monitoring
spec:
scaleTargetRef:
# you created above
apiVersion: apps/v1
kind: Deployment
name: sample-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Object
object:
target:
kind: Service
name: rabbitmq-cluster
metricName: rabbitmq_queue_messages_ready
targetValue: 5
You could consider approach using preStop hook.
As per documentation Container States, Define postStart and preStop handlers:
Before a container enters into Terminated, preStop hook (if any) is executed.
So you can use in your deployment:
lifecycle:
preStop:
exec:
command: ["your script"]
### update:
I would like to provide more information due to some research:
There is an interesting project:
KEDA allows for fine grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.
KEDA can run on both the cloud and the edge, integrates natively with Kubernetes components such as the Horizontal Pod Autoscaler, and has no external dependencies.
For the main question "Kubernetes could terminate a pod that is still doing the processing of its own message".
As per documentation:
"Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features"
Deployment is backed by Replicaset. As per this controller code there exist function "getPodsToDelete". In combination with "filteredPods" it gives the result: "This ensures that we delete pods in the earlier stages whenever possible."
So as proof of concept:
You can create deployment with init container. Init container should check if there is a message in the queue and exit when at least one message appears. This will allow main container to start, take and process that message. In this case we will have two kinds of pods - those which process the message and consume CPU and those who are in the starting state, idle and waiting for the next message. In this case starting containers will be deleted at the first place when HPA decide to decrease number of replicas in the deployment.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: complete
template:
metadata:
creationTimestamp: null
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
dnsPolicy: ClusterFirst
restartPolicy: Always
terminationGracePeriodSeconds: 30
Hope this help.
Horizontal Pod Autoscaler is not designed for long-running tasks, and will not be a good fit. If you need to spawn one long-running processing tasks per message, I'd take one of these two approaches:
Use a task queue such as Celery. It is designed to solve your exact problem: have a queue of tasks that needs to be distributed to workers, and ensure that the tasks run to completion. Kubernetes even provides an official example of this setup.
If you don't want to introduce another component such as Celery, you can spawn a Kubernetes job for every incoming message by yourself. Kubernetes will make sure that the job runs to completion at least once - reschedule the pod if it dies, etc. In this case you will need to write a script that reads RabbitMQ messages and creates jobs for them by yourself.
In both cases, make sure you also have Cluster Autoscaler enabled so that new nodes get automatically provisioned if your current nodes are not sufficient to handle the load.