How to structure an elastic Azure Batch application? - azure-batch

I am evaluating Batch for a project and, while it seems like it will do what I'm looking for, I'm not sure if what I am assuming is really correct.
I have what is basically a job runner from a queue. The current solution works but when the pool of nodes scales down, it just blindly kills off machines. I am looking for something that, when scaling down, will allow currently-running jobs to complete and then remove the node(s) from the pool. I also want to preemptively increase the pool size if a spike is likely to occur (and not have those nodes shut down). I can adjust the pool size externally if that makes sense (seems like the best option so far).
My current idea is to have one pool with one job & task per node, and that task listens to a queue in a loop for messages and processes them. After an iteration count and/or time limit, it shuts down, removing that node from the pool. If the pool size didn't change, I would like to replace that node with a new one. If the pool was shrunk, it should just go away. If the pool size increases, new nodes should run and start up the task.
I'm not planning on running something that continually add pools, or nodes to the pool, or tasks to a job, though I will probably have something that sets the pool size periodically based on queue length or something similar. What I would rather not do is something like "there are 10 things in the queue, add a pool with x nodes, then delete it".
Is this possible or are my expectations incorrect? So far, from reading the docs, it seems like it should be doable, and I have a simple task working, but I'm not sure about the scaling mechanics or exactly how to structure the tasks/jobs/pools.

Here's one possible way to lean into the strengths of Azure Batch and achieve what you've described.
Create your job with a JobManagerTask that monitors your queue for incoming work and adds a new Batch Task for each item of your workload. Each task will process a single piece of work, then exit.
The Batch Scheduler will then take care of allocating tasks to compute nodes. It can also take care of retrying tasks that fail, and so on.
Configure your pool with an AutoScale formula to dynamically resize your pool to meet your load. Your formula can specify taskcompletion to ensure tasks get to complete before any one compute node is removed.
If your workload peaks are predictable (say, 9am every day) your AutoScale expression could scale up your pool in anticipation. If those spikes are not predicable, your external monitoring (or your JobManager) can change the AutoScale expression at any time to suit.
If appropriate, your job manager can terminate once all the required tasks have been added; set onAllTasksComplete to terminatejob, ensuring your job is completed after all your tasks have finished.
A single pool can process tasks from multiple jobs, so if you have multiple concurrent workloads, they could share the same pool. You can give jobs different values for priority if you want certain jobs to be processed first.

Related

Handle replication lag in data pipeline

Given a pipeline of tasks that run in sequence. Each task consumes data from the database, manipulate it, and produces (write) to the same database.
We are using AWS RDS Aurora, and in order to spread the load, the “reading phase” of each task is done within the read replica.
In some cases of high loads, we reach replication lag of 10-15 seconds. This means that by the time the new task consume data, it gets wrong/missing data points.
We know this is not the “right” way to design such pipeline, and it contradicts the idiom “Do not communicate by sharing memory; instead, share memory by communicating”.
Since it’s too much effort to change the design now, we come up with alternative solution:
Create a service that check replication lag value and expose it to all tasks. If the value is greater than x, task will fallback to read from RDS master node.
This is not optimal, and I would like to hear other solution to bypass this issue.
It is worth mentioning that we are using Celery (& Python) to construct such workflow and each task is unaware of the tasks that ran previously.
There will always be data which is inserted into the database but not yet visible, either because it wasn't committed yet, it was committed after your snapshot was started, or due to replication lag. The only real solution is to make your tasks robust to this inevitability.
Create a service that check replication lag value and expose it to all tasks. If the value is greater than x, task will fallback to read from RDS master node.
You want to shed load from the master until the first sign of trouble, then you want to suddenly dump all the load back onto it?
Create a service that check replication lag value and expose it to all tasks. If the value is greater than x, task will fallback to read from RDS master node
Depending on the cause of your replication lag this might make things worse due to further increasing the load on the master node.
If your pipeline allows it you could wait in Task A, after write, until the data propagated to the read replica.

Druid Cluster going into Restricted Mode

We have a Druid Cluster with the following specs
3X Coordinators & Overlords - m5.2xlarge
6X Middle Managers(Ingest nodes with 5 slots) - m5d.4xlarge
8X Historical - i3.4xlarge
2X Router & Broker - m5.2xlarge
Cluster often goes into Restricted mode
All the calls to the Cluster gets rejected with a 502 error.
Even with 30 available slots for the index-parallel tasks, cluster only runs 10 at time and the other tasks are going into waiting state.
Loader Task submission time has been increasing monotonically from 1s,2s,..,6s,..10s(We submit a job to load the data in S3), after
recycling the cluster submission time decreases and increases again
over a period of time
We submit around 100 jobs per minute but we need to scale it to 300 to catchup with our current incoming load
Cloud someone help us with our queries
Tune the specs of the cluster
What parameters to be optimized to run maximum number of tasks in parallel without increasing the load on the master nodes
Why is the loader task submission time increasing, what are the parameters to be monitored here
At 100 jobs per minute, this is probably why the overlord is being overloaded.
The overlord initiates a job by communicating with the middle managers across the cluster. It defines the tasks that each middle manager will need to complete and it monitors the task progress until completion. The startup of each job has some overhead, so that many jobs would likely keep the overlord busy and prevent it from processing the other jobs you are requesting. This might explain why the time for job submissions increases over time. You could increase the resources on the overlord, but this sounds like there may be a better way to ingest the data.
The recommendation would be to use a lot less jobs and have each job do more work.
If the flow of data is so continuous as you describe, perhaps a kafka queue would be the best target followed up with a Druid kafka ingestion job which is fully scalable.
If you need to do batch, perhaps a single index_parallel job that reads many files would be much more efficient than many jobs.
Also consider that each task in an ingestion job creates a set of segments. By running a lot of really small jobs, you create a lot of very small segments which is not ideal for query performance. Here some info around how to think about segment size optimization which I think might help.

Celery: dynamically allocate concurrency based on worker memory

My celery use case: spin up a cluster of celery workers and send many tasks to that cluster, and then terminate the cluster when all of the tasks have completed (usually ~2 hrs).
I currently have it setup to use the default concurrency, which is not optimal for my use case. I see it is possible to specify a --concurrency argument in celery, which specifies the number of tasks that a worker will run in parallel. This is also not ideal for my use case, because, for example:
cluster A might have very memory intensive tasks and --concurrency=1 makes sense, but
cluster B might be memory light, and --concurrency=50 would optimize my workers.
Because I use these clusters very often for very different types of tasks, I don't want to have to manually profile the task beforehand and manually set the concurrency each time.
My desired behaviour is have memory thresholds. So for example, I can set in a config file:
min_worker_memory = .6
max_worker_memory = .8
Meaning that the worker will increment concurrency by 1 until the worker crosses over the threshold of using more than 80% memory. Then, it will decrement concurrency by 1. It will keep that concurrency for the lifetime of the cluster unless the worker memory falls below 60%, at which point it will increment concurrency by 1 again.
Are there any existing celery settings that I can leverage to do this, or will I have to implement this logic on my own? max memory per child seems somewhat close to what I want, but this ends in killed processes which is not what I want.
Unfortunately Celery does not provide an Autoscaler that scales up/down depending on the memory usage. However, being a well-designed piece of software, it gives you an interface that you may implement up to however you like. I am sure with the help of the psutil package you can easily create your own autoscaler. Documentation reference.

Autoscale Cadence clients to consume millions of Activities or run millions of workflow instance

We have millions of Activities to run or say millions of workflow instance getting created.
Can we create multiple instances of Worker or run worker with multiple thread.
Basically, I want to know, if we have millions of activities to perform or millions of workflow instance getting created. How can we autoscale.
I see two questions here:
Can we create multiple instances of Worker or run worker with multiple thread.
Yes, absolutely. The way you scale out the load is by adding more worker processes.
can we autoscale?
Autoscaling is possible by watching schedule to start latency of activity and workflow tasks. This latency represents the time a task spent in a task queue before being picked up by a worker. Ideally, if there are enough workers it is expected to be zero. But if workers cannot keep up with the load, it is going to grow as tasks being backlogged in the queue.

Kubernetes Orchestration depending upon number of rows/records/Input Files

Requirement is to orchestrate ETL containers depending upon the number of records present at the Source system (SQL/Google Analytics/SAAS/CSV files).
To explain take a Use Case:- ETL Job has to process 50K records present in SQL server, however, it takes good processing time to execute this job by one server/node as this server makes a connection with SQL, fetches the data and process the records.
Now the problem is how to orchestrate in Kubernetes this ETL Job so that it scales up/down the containers depending upon number of records/Input. Like the case discussed above if there are 50K records to process in parallel then it should scale up the containers process the records and scales down.
You would generally use a queue of some kind and Horizontal Pod Autoscaler (HPA) to watch the queue size and adjust the queue consumer replicas automatically. Specifics depend on the exact tools you use.