Detecting outstanding celery tasks - pytest

This is purely for a non-eager pytest mode of operation. I want to know when celery has "caught" up with all the outstanding work. Is there any way to find that information? My testing config has a celery_session_app and a single celery_session_worker in it's own thread.
Check the number of entries in the Rabbit queue. This has problems because of pre-fetch. I can set prefetch to 1 and maybe solve it that way but I worry about race conditions. (I'm testing chords and some celery tasks queue other celery tasks)
Add a task to the "end" of the list and then .wait() on it to finish. This has problems for tasks that queue other tasks because the queue is being extended in the other thread so I can be at the end of the list when queued, but that quickly moves forward as tasks are queued behind it. I can work around this using .apply_async(countdown=3) but this is pretty much the definition of a race condition and I might need countdown=4 or I might need nothing and that is some number of seconds wasted on a test regardless.
Use signals (somehow). But what I really need is a worker_is_bored which does not exist and suffers from the same kind of race conditions mentioned above. Tasks queueing tasks could make it flash "bored" and right back to "busy".
time.sleep(N) but what should N be. (i'm running pytest -n 10 so how busy the machine is during tests, is non-trivial). And this wastes time like countdown= above.

Related

What queuing tools or algorithms support fair queuing of jobs

I am hitting a well known problem, but I can't find a simple answer that tells me how to solve it.
I would appreciate you directing me by answering which feature I should look for in available queuing software or suitable algorithms if the solution requires programming in addition to the tools. and if you can direct me to Python supported tools, it would be helpful
My problem is that I get over the span of the day jobs which deploy 10, 100 or 1000 tests (I exaggerate , but it helps make a point). Many jobs deploy 10 tests, some deploy 100 tests and one or two deploy 1000 tests.
I want to deploy the tests in such a manner that the delay in execution is spread in a fair manner between all jobs. Let me explain myself.
If the very large job takes 2 hours on a idle server, it would be acceptable if it completes after 4 hours.
If a small job takes 3 minutes on an idle server, it would be acceptable if it completes after 15 minutes.
I want the delay of running the jobs to be spread in a fair way, so jobs that started earlier don't get too delayed. If it looks that the job is going to be delayed more than allowed it's priority will increase.
I think that prioritizing queues may be the solution, so dynamically changing the weights on a large queue will make it faster when needed.
Is there a queue software that knows how to do the above automatically. Lets say that I give each job some time limit and the queue software knows how to prioritize the tests from each queue so that no job is delayed too much?
Thanks.
Adding information following Jim's comments.
Not enough information to supply an answer. Is a job essentially just a list of tests? Can multiple tests for a single job be run concurrently? Do you always run all tests for a job? – Jim Mischel 14 hours ago
Each job deploys between 10 to 1000 tests.
The test can run concurrently to all other tests from the same or other users without conflicts.
All tests that were deploy by a job, are planned to run.
Additional info:
I've learned so far that Prioritized Queues are actually about applying weights to items in a single queue, where items with the hightest are pulled first. If two or more items have the same highest priority, the first item to arrive will be executed first.
When I pondered about Priority Queues it was more in the way of:
Multiple Queues, where each queue has a priority assigned to the entire queue.
The priority can be changed dynamically in runtime, based on some condition, e.g. setting a time limit on the execution of the entire queue.

Conditional restart of supervisord processes?

I've been using supervisord for a while -- outstanding tool. The one use case I haven't been able to figure out is, how to configure jobs to be restarted until a condition is met, then stop restarting.
Example: let's say you have a bunch of work to do, like scaling thousands of images, or servicing millions of requests on a queue. A useful pattern would be to run many workers in parallel to work on that backlog. You could set up a supervisord job that ensures 100 workers are running, and if any of them crash, supervisord will spin up replacements so the pool of workers won't shrink.
That's great until the work is done. Maybe when the backlog is gone, the number of workers should scale down to 1 or 0. Supervisord will keep spinning up the total to be 100 processes, even if each new process checks to see if there's work to be done, sees none, and shuts down very quickly.
Is there a way for a process instance or process family to communicate with supervisord to say, the autoretsart behavior is no longer needed? Better yet, is there a way to scale the number of worker processes up and down based on some condition (like number of files in a directory or ??).
I know it can be done by updating the supervisord.conf file and running supervisorctl reload, but I'd prefer something that's more declarative and self-managing if such a thing exists.
Is there a way for a process instance or process family to communicate with supervisord to say, the autoretsart behavior is no longer needed?
You can wind down an activity by making sure your processes exit with different exitcode(s) when there is no work and making those the expected exitcodes with autorestart=unexpected in the configuration.
Better yet, is there a way to scale the number of worker processes up and down based on some condition (like number of files in a directory or ??).
The trouble is that the automatic state transitions don't allow for getting processes running again from an expected EXITED state. AFAIK the only way to do this is with the XML-RPC API's startProcess, so you would need to write or find an appropriate event listener that watches for your start condition and then uses the API.
An alternate design is to wrap your worker process in an event handler watching PROCESS COMMUNICATION Events and have one normal subprocess communicating new tasks to a pool of event listeners. But that model doesn't currently eliminate a pool of waiting processes when there is no work, it just organizes the control task in a way that may make it easier to separate out task related logic and resource usage.

How can I create a Scheduled Task that will run every Second in MarkLogic?

MarkLogic Scheduled Tasks cannot be configured to run at an interval less than a minute.
Is there any way I can execute an XQuery module at an interval of 1 second?
NOTE:
Considering the situation where the Task Server is fully loaded and I need to make sure that the secondly scheduled task gets the Task Server thread whenever it needs.
Please let me know if there is anything in MarkLogic that can be used to achieve this.
Wanting rapid-fire scheduled tasks may be a hint that the design needs rethinking.
Even running a task once a minute can be risky, and needs careful thought to manage the possibilities of overlapping tasks and runaway tasks. If the application design calls for a scheduled task to run once a second, I would raise that as a potentially serious problem. Back up a few steps, and if necessary ask a new question about the higher-level problem that led to looking at scheduled tasks.
There was a sub-question about managing queue priority for tasks. Task priorities can handle some of that. There are two priorities: normal and higher. The Task Server empties the higher-priority queue first, then the normal queue. But each queue is still a simple queue, and there's no way to change priorities after a task has been spawned. So if you always queue tasks with priority=higher, then they'll all be in the higher priority queue and they'll all run in order. You can play some games with techniques like using server fields as signals to already-running tasks. But wanting to reorder tasks within a queue could be another hint that the design needs rethinking.
If, after careful thought about all the pitfalls and dangers, I decided I needed a rapid-fire task of some kind.... I would probably do it using external requests. Pick any scripting language and write a simple while loop with an HTTP request to the MarkLogic cluster. Even so, spend some time thinking about overlapping requests and locking. What happens if the request times out on the client side? Will it keep running on the server? Will that lead to overlapping requests and require deadlock resolution? Could it lead to runaway resource consumption?
Avoid any ideas that use xdmp:sleep. That will tie up a Task Server thread during the sleep period, and then you'll have two problems.

Work around celerybeat being a single point of failure

I'm looking for recommended solution to work around celerybeat being a single point of failure for celery/rabbitmq deployment. I didn't find anything that made sense so far, by searching the web.
In my case, once a day timed scheduler kicks off a series of jobs that could run for half a day or longer. Since there can only be one celerybeat instance, if something happens to it or the server that it's running on, critical jobs will not be run.
I'm hoping there is already a working solution for this, as I can't be the only one who needs reliable (clustered or the like) scheduler. I don't want to resort to some sort of database-backed scheduler, if I don't have to.
There is an open issue in celery github repo about this. Don't know if they are working on it though.
As a workaround you could add a lock for tasks so that only 1 instance of specific PeriodicTask will run at a time.
Something like:
if not cache.add('My-unique-lock-name', True, timeout=lock_timeout):
return
Figuring out lock timeout is well, tricky. We're using 0.9 * task run_every seconds if different celerybeats will try to run them at different times.
0.9 just to leave some margin (e.g. when celery is a little behind schedule once, then it is on schedule which would cause lock to still be active).
Then you can use celerybeat instance on all machines. Each task will be queued for every celerybeat instance but only one task of them will finish the run.
Tasks will still respect run_every this way - worst case scenario: tasks will run at 0.9*run_every speed.
One issue with this case: if tasks were queued but not processed at scheduled time (for example because queue processors was unavailable) - then lock may be placed at wrong time causing possibly 1 next task to simply not run. To go around this you would need some kind of detection mechanism whether task is more or less on time.
Still, this shouldn't be a common situation when using in production.
Another solution is to subclass celerybeat Scheduler and override its tick method. Then for every tick add a lock before processing tasks. This makes sure that only celerybeats with same periodic tasks won't queue same tasks multiple times. Only one celerybeat for each tick (one who wins the race condition) will queue tasks. In one celerybeat goes down, with next tick another one will win the race.
This of course can be used in combination with the first solution.
Of course for this to work cache backend needs to be replicated and/or shared for all of servers.
It's an old question but I hope it helps anyone.

Does it make sense to use a pool of Actors?

I'm just learning, and really liking, the Actor pattern. I'm using Scala right now, but I'm interested in the architectural style in general, as it's used in Scala, Erlang, Groovy, etc.
The case I'm thinking of is where I need to do things concurrently, such as, let's say "run a job".
With threading, I would create a thread pool and a blocking queue, and have each thread poll the blocking queue, and process jobs as they came in and out of the queue.
With actors, what's the best way to handle this? Does it make sense to create a pool of actors, and somehow send messages to them containing or the jobs? Maybe with a "coordinator" actor?
Note: An aspect of the case which I forgot to mention was: what if I want to constrain the number of jobs my app will process concurrently? Maybe with a config setting? I was thinking that a pool might make it easy to do this.
Thanks!
A pool is a mechanism you use when the cost of creating and tearing down a resource is high. In Erlang this is not the case so you should not maintain a pool.
You should spawn processes as you need them and destroy them when you have finished with them.
Sometimes, it makes sense to limit how many working processes you have operating concurrently on a large task list, as the task the process is spawned to complete involve resource allocations. At the very least processes use up memory, but they could also keep open files and/or sockets which tend to be limited to only thousands and fail miserably and unpredictable once you run out.
To have a pull-driven task pool, one can spawn N linked processes that ask for a task, and one hand them a function they can spawn_monitor. As soon as the monitored process has ended, they come back for the next task. Specific needs drive the details, but that is the outline of one approach.
The reason I would let each task spawn a new process is that processes do have some state and it is nice to start off a clean slate. It's a common fine-tuning to set the min-heap size of processes adjusted to minimize the number of GCs needed during its lifetime. It is also a very efficient garbage collection to free all memory for a process and start on a new one for the next task.
Does it feel weird to use twice the number of processes like that? It's a feeling you need to overcome in Erlang programming.
There is no best way for all cases. The decision depends on the number, duration, arrival, and required completion time of the jobs.
The most obvious difference between just spawning off actors, and using pools is that in the former case your jobs will be finished nearly at the same time, while in the latter case completion times will be spread in time. The average completion time will be the same though.
The advantage of using actors is the simplicity on coding, as it requires no extra handling. The trade-off is that your actors will be competing for your CPU cores. You will not be able to have more parallel jobs than CPU cores (or HT's, whatever), no matter what programming paradigm you use.
As an example, imagine that you need to execute 100'000 jobs, each taking one minute, and the results are due next month. You have four cores. Would you spawn off 100'000 actors having each compete over the resources for a month, or would you just queue your jobs up, and have execute four at a time?
As a counterexample, imagine a web server running on the same machine. If you have five requests, would you prefer to serve four users in T time, and one in 2T, or serve all five in 1.2T time ?