Least load scheduler - scheduled-tasks

I'm working on a system that uses several hundreds of workers in parallel (physical devices evaluating small tasks). Some workers are faster than others so I was wondering what the easiest way to load balance tasks on them without a priori knowledge of their speed.
I was thinking about keeping track of the number of tasks a worker is currently working on with a simple counter and then sorting the list to get the worker with the lowest active task count. This way slow workers would get some tasks but not slow down the whole system. The reason I'm asking is that the current round-robin method is causing hold up with some really slow workers (100 times slower than others) that keep accumulating tasks and blocking new tasks.
It should be a simple matter of sorting the list according to the current number of active tasks, but since I would be sorting the list several times a second (average work time per task is below 25ms) I fear that this might be a major bottleneck. So is there a simple version of getting the worker with the lowest task count without having to sort over and over again.
EDIT: The tasks are pushed to the workers via an open TCP connection. Since the dependencies between the tasks are rather complex (exclusive resource usage) let's say that all tasks are assigned to start with. As soon as a task returns from the worker all tasks that are no longer blocked are queued, and a new task is pushed to the worker. The work queue will never be empty.

How about this system:
Worker reaches the end of its task queue
Worker requests more tasks from load balancer
Load balancer assigns N tasks (where N is probably more than 1, perhaps 20 - 50 if these tasks are very small).
In this system, since you are assigning new tasks when the workers are actually done, you don't have to guess at how long the remaining tasks will take.

I think that you need to provide more information about the system:
How do you get a task to a worker? Does the worker request it or does it get pushed?
How do you know if a worker is out of work, or even how much work is it doing?
How are the physical devices modeled?
What you want to do is avoid tracking anything and find a more passive way to distribute the work.

Related

How should I pick ScheduleToStartTimeout and StartToCloseTimeout values for ActivityOptions

There are four different timeout options in the ActivityOptions, and two of those are mandatory without any default values: ScheduleToStartTimeout and StartToCloseTimeout.
What considerations should be made when selecting values for these timeouts?
As mentioned in the question, there are four different timeout options in ActivityOptions, and the differences between them may not be super clear to a new Cadence user. Let’s first briefly explain what those are:
ScheduleToStartTimeout: This configuration specifies the maximum
duration between the time the Activity is scheduled by a workflow and
it’s picked up by an activity worker to start executing it. In other
words, it configures the time a task spends in the queue.
StartToCloseTimeout: This one specifies the maximum time taken by
an activity worker from the time it fetches a task until it reports
the completion of it to the Cadence server.
ScheduleToCloseTimeout: This configuration specifies an end-to-end
timeout duration for an activity from the time it is scheduled by the
workflow until it is completed by an activity worker.
HeartbeatTimeout: If your activity is a heartbeating activity, this
configuration basically specifies the maximum duration the Cadence
server would wait for a heartbeat before assuming the activity worker
has failed.
How to select a proper timeout value
Picking the StartToCloseTimeout is fairly straightforward when you know what it does. Essentially, you should make this long enough so that the activity can complete under normal circumstances. Therefore, you should account for everything that can affect the time taken by an activity worker the latency of your down-stream (ie. services, networking etc.). On the other hand, you should aim to keep this value as small as it’s feasible to make your end-to-end system more responsive. If you can’t make this timeout less than a couple of minutes (ideally 1 minute or less), you should consider using a HeartbeatTimeout config and implement heartbeating in your activity.
ScheduleToCloseTimeout is also easy to understand, but it is more common to face issues caused by picking a less-than-ideal value here. Therefore, it’s important to ensure that a moment to pay some extra attention to this configuration.
Basically, you should consider everything that can create a backlog in the activity task queue. Some common events that contribute to a backlog are:
Reduced worker pool throughput due to deployments, maintenance or
network-related issues.
Down-stream latency spikes that would increase the time it takes to
complete each activity task, which then reduces the throughput of the
worker pool.
A significant spike in the number of workflow instances that schedule
the activity; especially if one of the upstream services is also an
asynchronous queue/stream processor which can create its own backlog
and suddenly start processing it at a very high-volume.
Ideally, no activity should timeout while waiting in the task queue, especially if the queue is backed up and the activity is configured to be retried. Because the retries would add more activity tasks to the queue and subsequently make it harder to recover from backlog or make it even worse. On the other hand, there are many use cases where business requirements really limit the total time the system can take to process an activity. Therefore, it’s usually not a bad idea to aim for a high ScheduleToCloseTimeout value as long as the business requirements allow. Depending on your use case, it might not make sense to keep your activity in the queue for more than a few minutes or it might be perfectly fine to keep it there for several days before timing out.

What queuing tools or algorithms support fair queuing of jobs

I am hitting a well known problem, but I can't find a simple answer that tells me how to solve it.
I would appreciate you directing me by answering which feature I should look for in available queuing software or suitable algorithms if the solution requires programming in addition to the tools. and if you can direct me to Python supported tools, it would be helpful
My problem is that I get over the span of the day jobs which deploy 10, 100 or 1000 tests (I exaggerate , but it helps make a point). Many jobs deploy 10 tests, some deploy 100 tests and one or two deploy 1000 tests.
I want to deploy the tests in such a manner that the delay in execution is spread in a fair manner between all jobs. Let me explain myself.
If the very large job takes 2 hours on a idle server, it would be acceptable if it completes after 4 hours.
If a small job takes 3 minutes on an idle server, it would be acceptable if it completes after 15 minutes.
I want the delay of running the jobs to be spread in a fair way, so jobs that started earlier don't get too delayed. If it looks that the job is going to be delayed more than allowed it's priority will increase.
I think that prioritizing queues may be the solution, so dynamically changing the weights on a large queue will make it faster when needed.
Is there a queue software that knows how to do the above automatically. Lets say that I give each job some time limit and the queue software knows how to prioritize the tests from each queue so that no job is delayed too much?
Thanks.
Adding information following Jim's comments.
Not enough information to supply an answer. Is a job essentially just a list of tests? Can multiple tests for a single job be run concurrently? Do you always run all tests for a job? – Jim Mischel 14 hours ago
Each job deploys between 10 to 1000 tests.
The test can run concurrently to all other tests from the same or other users without conflicts.
All tests that were deploy by a job, are planned to run.
Additional info:
I've learned so far that Prioritized Queues are actually about applying weights to items in a single queue, where items with the hightest are pulled first. If two or more items have the same highest priority, the first item to arrive will be executed first.
When I pondered about Priority Queues it was more in the way of:
Multiple Queues, where each queue has a priority assigned to the entire queue.
The priority can be changed dynamically in runtime, based on some condition, e.g. setting a time limit on the execution of the entire queue.

How can I create a Scheduled Task that will run every Second in MarkLogic?

MarkLogic Scheduled Tasks cannot be configured to run at an interval less than a minute.
Is there any way I can execute an XQuery module at an interval of 1 second?
NOTE:
Considering the situation where the Task Server is fully loaded and I need to make sure that the secondly scheduled task gets the Task Server thread whenever it needs.
Please let me know if there is anything in MarkLogic that can be used to achieve this.
Wanting rapid-fire scheduled tasks may be a hint that the design needs rethinking.
Even running a task once a minute can be risky, and needs careful thought to manage the possibilities of overlapping tasks and runaway tasks. If the application design calls for a scheduled task to run once a second, I would raise that as a potentially serious problem. Back up a few steps, and if necessary ask a new question about the higher-level problem that led to looking at scheduled tasks.
There was a sub-question about managing queue priority for tasks. Task priorities can handle some of that. There are two priorities: normal and higher. The Task Server empties the higher-priority queue first, then the normal queue. But each queue is still a simple queue, and there's no way to change priorities after a task has been spawned. So if you always queue tasks with priority=higher, then they'll all be in the higher priority queue and they'll all run in order. You can play some games with techniques like using server fields as signals to already-running tasks. But wanting to reorder tasks within a queue could be another hint that the design needs rethinking.
If, after careful thought about all the pitfalls and dangers, I decided I needed a rapid-fire task of some kind.... I would probably do it using external requests. Pick any scripting language and write a simple while loop with an HTTP request to the MarkLogic cluster. Even so, spend some time thinking about overlapping requests and locking. What happens if the request times out on the client side? Will it keep running on the server? Will that lead to overlapping requests and require deadlock resolution? Could it lead to runaway resource consumption?
Avoid any ideas that use xdmp:sleep. That will tie up a Task Server thread during the sleep period, and then you'll have two problems.

MPI Task Scheduling

I want to develop a task scheduler using MPI where there is a single master processor and there are worker/client processors. Each worker has all the data it needs to compute, but gets the index to work on from the master. After the computation the worker returns some data to the master. The problem is that some processes will be fast and some will be slow.
If I run a loop so that at each iteration the master sends and receives (blocking/non-blocking) data then it can't proceed to next step till it has received data from the current worker from the previous index assigned to it. The bottom line is if a worker takes too long to compute then it becomes the limiting factor and the master can't move on to assign an index to the next worker even if non-blocking techniques are used. Is it possible to skip assigning to a worker and move on to next.
I'm beginning to think that MPI might not be the paradigm to do this. Would python be a nice platform to do task scheduling?
This is absolutely possible using MPI_Irecv() and MPI_Test(). All the master process needs to do is post a non-blocking receive for each worker process, then in a loop test each one for incoming data. If a process is done, send it a new index, post a new non-blocking receive for it, and continue.
One MPI_IRecv for each process is one solution. This has the downside of needing to cancel unmatched MPI_IRecv when the work is complete.
MPI_ANY_SOURCE is an alternate path. This will allow the manager process to have a single MPI_IRecv outstanding at any given time, and the "next" process to MPI_Send will be matched with MPI_ANY_SOURCE. This has the downside of several ranks blocking in MPI_Send when there is no additional work to be done. Some kind of "nothing more to do" signal needs to be worked out, so the ranks can do a clean exit.

Does it make sense to use a pool of Actors?

I'm just learning, and really liking, the Actor pattern. I'm using Scala right now, but I'm interested in the architectural style in general, as it's used in Scala, Erlang, Groovy, etc.
The case I'm thinking of is where I need to do things concurrently, such as, let's say "run a job".
With threading, I would create a thread pool and a blocking queue, and have each thread poll the blocking queue, and process jobs as they came in and out of the queue.
With actors, what's the best way to handle this? Does it make sense to create a pool of actors, and somehow send messages to them containing or the jobs? Maybe with a "coordinator" actor?
Note: An aspect of the case which I forgot to mention was: what if I want to constrain the number of jobs my app will process concurrently? Maybe with a config setting? I was thinking that a pool might make it easy to do this.
Thanks!
A pool is a mechanism you use when the cost of creating and tearing down a resource is high. In Erlang this is not the case so you should not maintain a pool.
You should spawn processes as you need them and destroy them when you have finished with them.
Sometimes, it makes sense to limit how many working processes you have operating concurrently on a large task list, as the task the process is spawned to complete involve resource allocations. At the very least processes use up memory, but they could also keep open files and/or sockets which tend to be limited to only thousands and fail miserably and unpredictable once you run out.
To have a pull-driven task pool, one can spawn N linked processes that ask for a task, and one hand them a function they can spawn_monitor. As soon as the monitored process has ended, they come back for the next task. Specific needs drive the details, but that is the outline of one approach.
The reason I would let each task spawn a new process is that processes do have some state and it is nice to start off a clean slate. It's a common fine-tuning to set the min-heap size of processes adjusted to minimize the number of GCs needed during its lifetime. It is also a very efficient garbage collection to free all memory for a process and start on a new one for the next task.
Does it feel weird to use twice the number of processes like that? It's a feeling you need to overcome in Erlang programming.
There is no best way for all cases. The decision depends on the number, duration, arrival, and required completion time of the jobs.
The most obvious difference between just spawning off actors, and using pools is that in the former case your jobs will be finished nearly at the same time, while in the latter case completion times will be spread in time. The average completion time will be the same though.
The advantage of using actors is the simplicity on coding, as it requires no extra handling. The trade-off is that your actors will be competing for your CPU cores. You will not be able to have more parallel jobs than CPU cores (or HT's, whatever), no matter what programming paradigm you use.
As an example, imagine that you need to execute 100'000 jobs, each taking one minute, and the results are due next month. You have four cores. Would you spawn off 100'000 actors having each compete over the resources for a month, or would you just queue your jobs up, and have execute four at a time?
As a counterexample, imagine a web server running on the same machine. If you have five requests, would you prefer to serve four users in T time, and one in 2T, or serve all five in 1.2T time ?