Creating a queue system with Argo Workflows - kubernetes

I am trying to figure out how to set up a work queue with Argo. The Argo Workflows are computationally expensive. We need to plan for many simultaneous requests. The workflow items are added to the work queue via HTTP requests.
The flow can be demonstrated like this:
client
=> hasura # user authentication
=> redis # work queue
=> argo events # queue listener
=> argo workflows
=> redis + hasura # inform that workflow has finished
=> client
I have never build a K8s cluster that exceeds its resources. Where do I limit the execution of workflows? Or does Argo Events and Workflows limit these according to the resources in the cluster?
The above example could probably be simplified to the following, but the problem is what happens if the processing queue is full?
client
=> argo events # HTTP request listener
=> argo workflows

Argo Workflows has no concept of a queue, so it has no way of knowing when the queue is full. If you need queue control, that should happen before submitting workflows.
Once the workflows are submitted, there are a number of ways to limit resource usage.
Pod resources - each Workflow step is represented by a Kubernetes Pod. You can set resource requests and limits just like you would with a Pod in a Deployment.
Step parallelism limit - within a Workflow, you can limit the number of steps running concurrently. This can help when a step is particularly resource-intensive.
Workflow parallelism limit - you can limit the number of workflows running concurrently by configuring them to us a semaphore.
There are a number of other performance optimizations like setting Workflow and Pod TTLs and offloading YAML for large Workflows to a DB instead of keeping them on the cluster.
As far as I know, there is no way to set a Workflow limit so that Argo will reject additional Workflow submissions until more resources are available. This is a problem if you're worried about Kubernetes etcd filling up with too many Workflow definitions.
To keep from blowing up etcd, you'll need another app of some kind sitting in front of Argo to queue Workflows submissions until more resources become available.

Related

In Kubernetes, are resource quotas a good way to throttle how much CPU and memory is allowed for running jobs at a given time?

Suppose I have an API that allows users to create jobs (V1Jobs) in Kubernetes. What I would like is for the user to be able to submit as many jobs as they want without a failed response from the API but to have Kubernetes queue/throttle the jobs until there are enough available resources in the given namespace. For example, suppose I create a resource quota and specify a limit of 1cpu and 1Gi memory. Then suppose a user submits 100 1cpu/1Gi jobs. I'd like Kubernetes to process one at a time until they are complete. In other words running 100 jobs one at a time. Is creating the resource quota and letting the job-controller/scheduler handle the throttling the right way to go or would there be benefits to handle tracking the cluster usage externally (in an application) and only submit/create the V1Jobs to the API once there is capacity in the namespace?
ResourceQuotas are a good start, limiting the amount of resources a that may be used within a namespace, or by resources matching an expression.
It would indeed prevent the scheduler from creating Pods that would exceed your quota limitations. The API would still accept new Job objects posted by clients. If you have N Jobs requesting 1 CPU/1G RAM, while your quota only allows for less than 2CPU/2G RAM to be used, you should see those jobs running sequentially.
Though it could still make sense to track how many pending/running jobs you have in your namespace, as this could show there are currently too many jobs to run with your current quota configuration. The kube-state-metrics exporter from Prometheus would gather the metrics you need for this, you'ld find sample dashboards in Grafana, alerting rules over there.
If there's a risk some containers would start without passing proper cpu or memory resources requests / limits, you could also look into LimitRanges, forcing some defaults.
Resource quotas are applied at the namespace level. For containers resourceRequests and resourceLimits are available to specify the min and Max cpu and memory respectively. Setting resourceRequests and resourceLimits, allows the job to use only the specified cpu and memory range. This way we can schedule more than one job in a namespace with both resourceQuotas at namespace level and resourceRequests and resourceLimits at jobs.
Also, there is no harm in letting the scheduler takes care of not scheduling jobs when there is no resource.

How do you create a message queue service for the scope of a specific Kubernetes job

I have a parallel Kubernetes job with 1 pod per work item (I set parallelism to a fixed number in the job YAML).
All I really need is an ID per pod to know which work item to do, but Kubernetes doesn't support this yet (if there's a workaround I want to know).
Therefore I need a message queue to coordinate between pods. I've successfully followed the example in the Kubernetes documentation here: https://kubernetes.io/docs/tasks/job/coarse-parallel-processing-work-queue/
However, the example there creates a rabbit-mq service. I typically deploy my tasks as a job. I don't know how the lifecycle of a job compares with the lifecycle of a service.
It seems like that example is creating a permanent message queue service. But I only need the message queue to be in existence for the lifecycle of the job.
It's not clear to me if I need to use a service, or if I should be creating the rabbit-mq container as part of my job (and if so how that works with parallelism).

Delay in Kubernetes Job status update when running many jobs in parallel

I have a bit of a unique use-case where I want to run a large number (thousands to tens of thousands) of Kubernetes Jobs at once. Each job consists of a single container, Parallelism 1 and Completions 1, with no side-car or agent. My cluster has plenty of capacity for the resources I'm requesting.
My problem is that the Job status is not transitioning to Complete for a significant period of time when I run many jobs concurrently.
My application submits Jobs and has a watcher on the namespace - as soon as a Job's status transitions to 'succeeded 1', we delete the Job and send information back to the application. The application needs this to happen as soon as possible in order to define and submit subsequent Jobs.
I'm able to submit new Job requests as fast as I want, and Pod scheduling happens without delay, but beyond about one or two hundred concurrent Jobs I get significant delay between a Job's Pod completing and the Job's status updating to Complete. At only around 1,000 jobs in the cluster, it can easily take 5-10 minutes for a Job status to update.
This tells me there is some process in the Kubernetes Control Plane that needs more resources to process Pod completion events more rapidly, or a configuration option that enables it to process more tasks in parallel. However, my system monitoring tools have not yet been able to identify any Control Plane services that are maxing out their available resources while the cluster processes the backlog, and all other operations on the cluster appear to be normal.
My question is - where should I look for system resource or configuration bottlenecks? I don't know enough about Kubernetes to know exactly what components are responsible for updating a Job's status.

Kubernetes quotas queue

I need to queue Kubernetes resources, basing on the Kubernetes quotas.
Sample expected scenario:
a user creates Kubernetes resource (let's say a simple X pod)
quora object resource count reached, pod X goes to the Pending state
resources are released (other pod Y removed), our X pod starts creating
For, now this scenario will not work, due to the quota behavior which returns 403 FORBIDDEN, when there are no free resources in quota:
If creating or updating a resource violates a quota constraint, the request will fail with HTTP status code 403 FORBIDDEN with a message explaining the constraint that would have been violated.
Question:
Is there a way to achieve this via native Kubernetes mechanisms?
I was trying to execute pods over Kubernetes Jobs, but each job starts independently and I'm unable to control the execution order. I would like to execute them in First In First Out method.
IMO, if k8s hasn't accepted the resource, how come it manage its lifecycle or execution order.
If I understood your question correctly, then its the same pod trying to be scheduled then, your job should be designed in such a way that order of job execution should not matter because there could be scenarios where one execution is not completed and next one comes up or previous one failed to due some error or dependent service being unavailable. So, the next execution should be able to start from where the last one left.
You can also look at the work queue pattern in case it suits your requirements as explained https://kubernetes.io/docs/tasks/job/fine-parallel-processing-work-queue/
In case you just want one job to be in execution at one time.
I think, running jobs in predefined order must be managed by external logic. We use Jenkins Pipeline for that.

Kubernetes dynamic Job scaling

I’m finally dipping my toes in the kubernetes pool and wanted to get some advice on the best way to approach a problem I have:
Tech we are using:
GCP
GKE
GCP Pub/Sub
We need to do bursts of batch processing spread out across a fleet and have decided on the following approach:
New raw data flows in
A node analyses this and breaks the data up into manageable portions which are pushed onto a queue
We have a cluster with Autoscaling On and Min Size ‘0’
A Kubernetes job spins up a pod for each new message on this cluster
When pods can’t pull anymore messages they terminate successfully
The question is:
What is the standard approach for triggering jobs such as this?
Do you create a new job each time or are jobs meant to be long lived and re-run?
I have only seen examples of using a yaml file however we would probably want the node which did the portioning of work to create the job as it knows how many parallel pods should be run. Would it be recommended to use the python sdk to create the job spec programatically? Or if jobs are long lived would you simply hit the k8 api and modify the parallel pods required then re-run job?
Jobs in Kubernetes are meant to be short-lived and are not designed to be reused. Jobs are designed for run-once, run-to-completion workloads. Typically they are be assigned a specific task, i.e. to process a single queue item.
However, if you want to process multiple items in a work queue with a single instance then it is generally advisable to instead use a Deployment to scale a pool of workers that continue to process items in the queue, scaling the number of pool workers dependent on the number of items in the queue. If there are no work items remaining then you can scale the deployment to 0 replicas, scaling back up when there is work to be done.
To create and control your workloads in Kubernetes the best-practice would be to use the Kubernetes SDK. While you can generate YAML files and shell out to another tool like kubectl using the SDK simplifies configuration and error handling, as well as allowing for simplified introspection of resources in the cluster as well.