Django Celery Rate Limit setting doesn't work as expected - celery

I use below command to create one worker:
celery -A proj worker -l info --concurrency=50 -Q celery,token_1 -n token_1
And in my task, I set the rate limit to 4000/m.
However, when I start running the collection, I noticed the average task processed is just around 10-20/s (with rate limit rule 4000/m enabled).
Then, I removed the rate limit rule, now the task rates goes to around 60/s.
I am confused, since my rate limit is 4000/m, which is relatively 65/s. Why it finally goes just 10-20/s????? (I have already set 50 threads for the worker....)

You're misunderstanding how the rate limits operate in celery. 'According to the documentation on version 4.2:
The rate limits can be specified in seconds, minutes or hours by appending “/s”`, “/m” or “/h” to the value. Tasks will be evenly distributed over the specified time frame.
Example: “100/m” (hundred tasks a minute). This will enforce a minimum delay of 600ms between starting two tasks on the same worker instance.
In essence, celery was adding a forced delay between your tasks. Since each task was already processing in about 16ms (1/60 of a second), adding another 16 ms forced delay between tasks reduced the rate at which they would process.

Related

How to put a rate limit on a celery queue?

I read this in the celery documentation for Task.rate_limit:
Note that this is a per worker instance rate limit, and not a global rate limit. To enforce a global rate limit (e.g., for an API with a maximum number of requests per second), you must restrict to a given queue.
How do I put a rate limit on a celery queue?
Turns out it cant be done at queue level for multiple workers.
IT can be done at queue level for 1 worker. Or at queue level for each worker.
So if u say 10 jobs/ minute on 5 workers. Your workers will process upto 50 jobs per minute collectively.
So to have only 10 jobs running at a time you either chose one worker. Or chose 5 workers with a limit of 2/minute.
Update: How to exactly put the limit in settings/configuration:
task_annotations = {'tasks.<task_name>': {'rate_limit': '10/m'}}
or change the same for all tasks:
task_annotations = {'*': {'rate_limit': '10/m'}}
10/m means 10 tasks per minute, /s would mean per second. More details here: Task annotations setting
hey I am trying to find a way to do rate limit on queue, and I find out Celery can't do that, however Celery can control the rate per tasks, see this:
http://docs.celeryproject.org/en/latest/userguide/workers.html#rate-limits
so for a workaround, maybe you can set up one tasks per queue(which makes sense in a lot of situations), and put the limit on task.
You can set this limit in the flower > worker pane.
there is a specified blank space for entering your limit there.
The format that is suggested to be used is also like the below:
The rate limits can be specified in seconds, minutes or hours by appending “/s”, >“/m” or “/h” to the value. Tasks will be evenly distributed over the specified >time frame.
Example: “100/m” (hundred tasks a minute). This will enforce a minimum delay of >600ms between starting two tasks on the same worker instance.

Scheduling jobs in Quartz as a process

Is jobs in quartz are executed as process or thread?
If it is executed as a thread then will it effect the performance of quartz scheduler when heavy jobs or time consuming jobs are executed.
If so then please suggest the solution.
If we execute 10 time consuming jobs simultaneously what is the effect?
I read the tutorials but didnt find the solution.
Please suggest the solution.
Thanks.
Read the documentation regarding Configuring the thread pool which explains how the quartz thread pool can be suited for your needs. More specifically the org.quartz.threadPool.threadCount configuration property can be set according to your needs as the documentation explains:
The number of threads available for concurrent execution of jobs. You
can specify any positive integer, although only numbers between 1 and
100 are practical. If you only have a few jobs that fire a few times a
day, then one thread is plenty. If you have tens of thousands of jobs,
with many firing every minute, then you want a thread count more like
50 or 100 (this highly depends on the nature of the work that your
jobs perform, and your systems resources).
In the specific example you mentioned regarding 10 jobs firing simultaneously, if you have configured above property with more than 10 threads, then each job will run concurrently on its own thread. Otherwise if you have configured less, some will start first, and the others will wait for threads to become available. If no threads become available until a configured period of time, the misfire instructions you have set will handle the action to be taken, which usually is to trigger delayed jobs as soon as possible but this is also a configurable setting.

Is this an intelligent use case for optaPlanner?

I'm trying to clean up an enterprise BI system that currently is using a prioritized FIFO scheduling algorithm (so a priority 4 report from Tuesday will be executed before priority 4 reports from Thursday and priority 3 reports from Monday.) Additional details:
The queue is never empty, jobs are always being added
Jobs range in execution time from under a minute to upwards of 24 hours
There are 40 some odd identical app servers used to execute jobs
I think I could get optaPlanner up and running for this scenario, with hard rules around priority and some soft rules around average time in the queue. I'm new to scheduling optimization so I guess my question is what should I be looking for in this situation to decide if optaPlanner is going to help me or not?
The problem looks like a form of bin packing (and possibly job shop scheduling), which are NP-complete, so OptaPlanner will do better than a FIFO algorithm.
But is it really NP-complete? If all of these conditions are met, it might not be:
All 40 servers are identical. So running a priority report on server A instead of server B won't deliver a report faster.
All 40 servers are identical. So total duration (for a specific input set) is a constant.
Total makespan doesn't matter. So given 20 small jobs of 1 hour and 1 big job of 20 hours and 2 machines, it's fine that it takes all small jobs are done after 10 hours before the big job starts, given a total makespan of 30 hours. There's no desire to reduce the makespan to 20 hours.
"the average time in the queue" is debatable: do you care about how long the jobs are in the queue until they are started or until they are finished? If the total duration is a constant, this can be done by merely FIFO'ing the small jobs first or last (while still respecting priority of course).
There are no dependencies between jobs.
If all these conditions are met, OptaPlanner won't be able to do better than a correctly written greedy algorithm (which schedules the highest priority job that is the smallest/largest first). If any of these conditions aren't met (for example you buy 10 new servers which are faster), then OptaPlanner can do better. You just have to evaluate if it's worth spending 1 thread to figure that out.
If you use OptaPlanner, definitely take a look at real-time scheduling and daemon mode, to replan as new reports enter the system.

Scheduling policies in Linux Kernel

Can there be more than two scheduling policies working at the same time in Linux Kernel ?
Can FIFO and Round Robin be working on the same machine ?
Yes, Linux supports no less then 4 different scheduling methods for tasks: SCHED_BATCH, SCHED_FAIR, SCHED_FIFO and SCHED_RR.
Regardless of scheduling method, all tasks also have a fixed hard priority (which is 0 for batch and fair and from 1- 99 for the RT schedulign methods of FIFO and RR). Tasks are first and foremost picked by priority - the highest priority wins.
However, with several tasks available for running with the same priority, that is where the scheduling method kicks in: A fair task will only run for its allotted weighted (with the weight coming from a soft priority called the task nice level) share of the CPU time with regard to other fair tasks, a FIFO task will run for a fixed time slice before yielding to another task (of the same priority - higher priority tasks always wins) and RR tasks will run till it blocks disregarding other tasks with the same priority.
Please note what I wrote above is accurate but not complete, because it does not take into account advance CPU reservation features, but it give the details about different scheduling method interact with each other.
yes !! now a days we have different scheduling policies at different stages in OS .. Round robin is done generally before getting the core execution ... fifo is done, at start stage of new coming process ... !!!

Least load scheduler

I'm working on a system that uses several hundreds of workers in parallel (physical devices evaluating small tasks). Some workers are faster than others so I was wondering what the easiest way to load balance tasks on them without a priori knowledge of their speed.
I was thinking about keeping track of the number of tasks a worker is currently working on with a simple counter and then sorting the list to get the worker with the lowest active task count. This way slow workers would get some tasks but not slow down the whole system. The reason I'm asking is that the current round-robin method is causing hold up with some really slow workers (100 times slower than others) that keep accumulating tasks and blocking new tasks.
It should be a simple matter of sorting the list according to the current number of active tasks, but since I would be sorting the list several times a second (average work time per task is below 25ms) I fear that this might be a major bottleneck. So is there a simple version of getting the worker with the lowest task count without having to sort over and over again.
EDIT: The tasks are pushed to the workers via an open TCP connection. Since the dependencies between the tasks are rather complex (exclusive resource usage) let's say that all tasks are assigned to start with. As soon as a task returns from the worker all tasks that are no longer blocked are queued, and a new task is pushed to the worker. The work queue will never be empty.
How about this system:
Worker reaches the end of its task queue
Worker requests more tasks from load balancer
Load balancer assigns N tasks (where N is probably more than 1, perhaps 20 - 50 if these tasks are very small).
In this system, since you are assigning new tasks when the workers are actually done, you don't have to guess at how long the remaining tasks will take.
I think that you need to provide more information about the system:
How do you get a task to a worker? Does the worker request it or does it get pushed?
How do you know if a worker is out of work, or even how much work is it doing?
How are the physical devices modeled?
What you want to do is avoid tracking anything and find a more passive way to distribute the work.