Celery: limit memory usage (large number of django installations) - celery

we're having a setup with a large number of separate django installations on a single box. each of these have their own code base & linux user.
we're using celery for some asynchronous tasks.
each of the installations has its own setup for celery, i.e. its own celeryd & worker.
the amount of asynchronous tasks per installation is limited, and not time-critical.
when a worker starts it takes about 30mb of memory. when it has run for a while this amount may grow (presumably due to fragmentation).
the last bulletpoint has already been (somewhat) solved by settings --maxtasksperchild to a low number (say 10). This ensures a restart after 10 tasks, after which the memory at least goes back to 30MB.
However, each celeryd is still taking up a lot of memory, since the minimum amount of workers appears to be 1 as opposed to 0. I also imagine running python manage.py celery worker does not lead to the smallest-possible footprint for the celeryd, since the full stack is loaded even if the only thing that happens is checking for tasks.
In an ideal setup, I'd like to see the following: a process that has a very small memory footprint (100k or so) is looking at the queue for new tasks. when such a task arises, it spins up the (heavy) full django stack in a separate process. and when the worker is done, the heavy process is spun down.
Is such a setup configurable using (somewhat) standard celery? If not, what points of extension are there?
we're (currently) using celery 3.0.17 and the associated django-celery.

Just to make sure I understand - you have a lot of different django codebases, each with their own celery, and they take up too much memory when running on a single box simultaneously, all waiting for a celery job to come down the pipe? How many celery instances are we talking about here?
In my experience, you're using django celery in a very different way than it was designed for - all of your different django projects should be condensed to a few (or a single) project(s), composed of multiple applications. Then you set up a small number of queues to field celery tasks from the different apps - this way, you only have as many dormant celery threads taking up 30mb as you have queues, and a single queue can handle multiple tasks (from multiple apps if you want). The memory issue should go away.
To reiterate - you only need one celeryd, driving multiple workers. This way your bottleneck is job concurrency, not dormant memory needs.
Why do you need so many django installations? Please let me know if I'm missing something, or if you need clarification.

Related

Celery: dynamically allocate concurrency based on worker memory

My celery use case: spin up a cluster of celery workers and send many tasks to that cluster, and then terminate the cluster when all of the tasks have completed (usually ~2 hrs).
I currently have it setup to use the default concurrency, which is not optimal for my use case. I see it is possible to specify a --concurrency argument in celery, which specifies the number of tasks that a worker will run in parallel. This is also not ideal for my use case, because, for example:
cluster A might have very memory intensive tasks and --concurrency=1 makes sense, but
cluster B might be memory light, and --concurrency=50 would optimize my workers.
Because I use these clusters very often for very different types of tasks, I don't want to have to manually profile the task beforehand and manually set the concurrency each time.
My desired behaviour is have memory thresholds. So for example, I can set in a config file:
min_worker_memory = .6
max_worker_memory = .8
Meaning that the worker will increment concurrency by 1 until the worker crosses over the threshold of using more than 80% memory. Then, it will decrement concurrency by 1. It will keep that concurrency for the lifetime of the cluster unless the worker memory falls below 60%, at which point it will increment concurrency by 1 again.
Are there any existing celery settings that I can leverage to do this, or will I have to implement this logic on my own? max memory per child seems somewhat close to what I want, but this ends in killed processes which is not what I want.
Unfortunately Celery does not provide an Autoscaler that scales up/down depending on the memory usage. However, being a well-designed piece of software, it gives you an interface that you may implement up to however you like. I am sure with the help of the psutil package you can easily create your own autoscaler. Documentation reference.

determine ideal number of workers and EC2 sizing for master

I have a requirement to use locust to simulate 20,000 (and higher) users in a 10 minute test window.
the locustfile is a tasksquence of 9 API calls. I am trying to determine the ideal number of workers, and how many workers should be attached to an EC2 on AWS. My testing shows with 20 workers, on two EC2 instance, the CPU load is minimal. the master however suffers big time. a 4 CPU 16 GB RAM system as the master ends up thrashing to the point that the workers start printing messages like this:
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.util.exception_handler: Retry failed after 3 times.
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/ERROR/locust.runners: RPCError found when sending heartbeat: ZMQ sent failure
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.runners: Reset connection to master
the master seems memory exhausted as each locust master process has grown to 12GB virtual RAM. ok - so the EC2 has a problem. But if I need to test 20,000 users, is there a machine big enough on the planet to handle this? or do i need to take a different approach and if so, what is the recommended direction?
In my specific case, one of the steps is to download a file from CloudFront which is randomly selected in one of the tasks. This means the more open connections to cloudFront trying to download a file, the more congested the available network becomes.
Because the app client is actually a native app on a mobile and there are a lot of factors affecting the download speed for each mobile, I decided to to switch from a GET request to a HEAD request. this allows me to test the response time from CloudFront, where the distribution is protected by a Lambda#Edge function which authenticates the user using data from earlier in the test.
Doing this dramatically improved the load test results and doesn't artificially skew the other testing happening as with bandwidth or system resource exhaustion, every other test will be negatively impacted.
Using this approach I successfully executed a 10,000 user test in a ten minute run-time. I used 4 EC2 T2.xlarge instances with 4 workers per T2. The 9 tasks in test plan resulted in almost 750,000 URL calls.
The answer for the question in the title is: "It depends"
Your post is a little confusing. You say you have 10 master processes? Why?
This problem is most likely not related to the master at all, as it does not care about the size of the downloads (which seems to be the only difference between your test case and most other locust tests)
There are some general tips that might help:
Switch to FastHttpUser (https://docs.locust.io/en/stable/increase-performance.html)
Monitor your network usage (if your load gens are already maxing out their bandwidth or CPU then your test is very unrealistic anyway, and adding more users just adds to the noice. In general, start low and work your way up)
Increase the number of loadgens
In general, the number of users is not an issue for locust, but number of requests per second or bandwidth might be.

What is happening when Matlab is starting a "parallel pool"?

Running parallel CPU processes in Matlab starts with the command
parpool()
According to the documentation, that function:
[creates] a special job on a pool of workers, and [connects] the MATLAB client to the parallel pool.
This function usually takes a bit of time to execute, on the order of 30 seconds. But in other multi-CPU paradigms like OpenMP, the parallel execution seems totally transparent -- I've never noticed any behavior analogous to what Matlab does (granted I'm not very experienced with parallel programming).
So, what is actually happening between the time that parpool() is called and when it finishes executing? What takes so long?
Parallel Computing Toolbox enables you to run MATLAB code in parallel using several different paradigms (e.g. jobs and tasks, parfor, spmd, parfeval, batch processing), and to run it either locally (parallelised across cores in your local machine) or remotely (parallelised across machines in a cluster - either one that you own, or one in the cloud).
In any of these cases, the code is run on MATLAB workers, which are basically copies of MATLAB without an interactive desktop.
If you're intending to run on a remote cluster, it's likely that these workers will already be started up and ready to run code. If you're intending to run locally, it's possible that you might already have started workers, but maybe you haven't.
Some of constructs above (e.g. jobs and tasks, batch processing) just run the thing you asked for, and the workers then go back to being available for other things (possibly from a different user).
But some of the constructs (e.g. parfor, spmd) require that the workers on which you intend to run are reserved for you for a period of time - partly because they might lie idle for some time and you don't want them taken over by another user, and partly because (unlike with jobs and tasks, or batch processing) they might need to communicate with each other. This is called creating a worker pool.
When you run parpool, you're telling MATLAB that you want to reserve a pool of workers for yourself, because you're intending to run a construct that requires a worker pool. You can specify as an input argument a cluster profile, which would tell it whether you want to run on a remote cluster or locally.
If you're running on a cluster, parpool will send a message to the cluster to reserve some of its (already running) workers for your use.
If you're running locally, parpool will ensure that there are enough workers running locally, and then connect them into a pool for you.
The thing that takes 30 seconds is the part where it needs to start up workers, if they're not already running. On Windows, if you watch Task Manager while running parpool, you'll see additional copies of MATLAB popping up over those 30 seconds as the workers start (they're actually not MATLAB itself, they're MATLAB workers - you can distinguish them as they'll be using less memory with no desktop).
To compare what MATLAB is doing to OpenMP, note that these MATLAB workers are separate processes, whereas OpenMP creates multiple threads within an existing process.
To be honest I do not think that we will ever get to know exactly what MatLab does.
However, to give you some answer, MatLab basically opens additional instances of itself, for it to execute the code on. In order to do this, it first needs to check where the instances should be opened (you can change the cluster from local to whatever else you have access to, e.g. Amazons EC2 cluster). Once the new instances have been opened MatLab then has set the connection from your main window to the instances up.
Notes:
1) It is not recommended to use parpool inside a function or script as if it is run while a parallel pool is open it will cast an error. The use of parallel commands e.g. parfor will automatically open the instance.
2) parpool only have to be executed "once" (before it is shut down), i.e. if you run the code again the instances are already open.
3) If you want to avoid the overhead in your codes, you can create a file called startup.m in the search path of MATLAB, with the command parpool, this will automatically start a parallel pool on startup.
4) Vectorizing your code will automatically make it parallelised without the overhead.
A few further details to follow up on #Nicky's answer. Creating a parallel pool involves:
Submitting a communicating job to the appropriate cluster
This recruits MATLAB worker processes. These processes might be already running (in the case of MJS), or they might need to be started (in the case of 'local', and all other cluster types).
MPI communication is set up among the workers to support spmd (unless you specify "'SpmdEnabled', false" while starting the pool - however, this stage isn't usually a performance bottleneck).
The MATLAB workers are then connected up to the client so they can do its bidding.
The difference in overhead between parpool and something like OpenMP is because parpool generally launches additional MATLAB processes - a relatively heavy-weight operation, whereas OpenMP simply creates additional threads within a single process - comparatively light-weight. Also, as #Nicky points out - MATLAB can intrinsically multi-thread some/most vectorised operations - parpool is useful for cases where this doesn't happen, or where you have a real multi-node cluster available to run on.

Celery - Granular tasks vs. message passing overhead

The Celery docs section Performance and Strategies suggests that tasks with multiple 'steps' should be divided into subtasks for more efficient parallelization. It then mentions that (of course) there will be more message passing overhead, so dividing into subtasks may not be worth the overhead.
In my case, I have an overall task of retrieving a small image (150px x 115px) from a third party API, then uploading via HTTP to my site's REST API. I can either implement this as a single task, or divide up the steps of retrieving the image and then uploading it into two seperate tasks. If I go with seperate tasks, I assume I will have to pass the image as part of the message to the second task.
My question is, which approach should be better in this case, and how can I measure the performance in order to know for sure?
Since your jobs are I/O-constrained, dividing the task may increase the number of operations that can be done in parallel. The message-passing overhead is likely to be tiny since any capable broker should be able to handle lots of messages/second with only a few ms of latency.
In your case, uploading the image will probably take longer than downloading it. With separate tasks, the download jobs needn't wait for uploads to finish (so long as there are available workers). Another advantage of separation is that you can put each job on different queue and dedicate more workers as backed-up queues reveal themselves.
If I were to try to benchmark this, I would compare execution times using same number of workers for each of the two strategies. For instance 2 workers on the combined task vs 2 workers on the divided one. Then do 4 workers on each and so on. My inclination is that the separated task will show itself to be faster; especially when the worker count is increased.

Does a modified command invocation tool – which dynamically regulates a job pool according to load – already exist?

Fellow Unix philosophers,
I programmed some tools in Perl that have a part which can run in parallel. I outfitted them with a -j (jobs) option like make and prove have because that's sensible. However, soon I became unhappy with that for two reasons.
I specify --jobs=2 because I have two CPU cores, but I should not need to tell the computer information that it can figure out by itself.
Rarely runs of the tool occupy more than 20% CPU (I/O load is also little), wasting time by not utilising CPU to a better extent.
I hacked some more to add a load measuring, spawning additional jobs while there's still »capacity« until a load threshold is reached, this is when the number of jobs stays more or less constant, but when during the course of a run other processes with higher priority are in demand of more CPU, over time less new jobs are spawned and accordingly the number of jobs reduces.
Since this responsibility was repeated code in the tools, I factored out the scheduling aspect into a stand-alone tool following the spirit of nice et al.. The parallel tools are quite dumb now, they only have signal handlers through which they are told to increase or decrease the jobs pool, whereas the intelligence of load measuring and figuring out when to control the pool resides in the scheduler.
Taste of the tentative interface (I also want to provide sensible defaults so options can be omitted):
run-parallel-and-schedule-job-pool \
--cpu-load-threshold=90% \
--disk-load-threshold='300 KiB/s' \
--network-load-threshold='1.2 MiB/s' \
--increase-pool='/bin/kill -USR1 %PID' \
--decrease-pool='/bin/kill -USR2 %PID' \
-- \
parallel-something-master --MOAR-OPTIONS
Before I put effort into the last 90%, do tell me, am I duplicating someone else's work? The concept is quite obvious, so it seems it should have been done already, but I couldn't find this as a single responsibility stand-alone tool, only as deeply integrated part of larger many-purpose sysadmin suites.
Bonus question: I already know runN and parallel. They do parallel execution, but do not have the dynamic scheduling (niceload goes into that territory, but is quite primitive). If against my expectations the stand-alone tool does not exists yet, am I better off extending runN myself or filing a wish against parallel?
some of our users are quite happy with condor. It is a system for dynamically distributing jobs to other workstations and servers according to their free computing resources.