using zookeeper to execute one by one - apache-zookeeper

In a distributed system, 5 processes, using zookeeper to coordinate.
I need these processes to run one by one in every round.
run order is dynamic, but is known for every round.
Any zookeeper recipe can do this? Thanks
for example:
round 1: 1 2 3 4 5
round 2: 3 2 4 1 5

Your problem is not fully defined:
In a situation when a given process fails to start do we have any kind of a timeout or does the system just halt?
Do the processes terminate between rounds, so that they are started by some external mechanism for every round?
Do we care about process termination status? If a process started but then failed in the middle of the execution, should the next process start?
In the simplest case, when we
accept halting
don't differentiate between a successful termination and a failure
assume that every process is somehow started for every round and knows its position in the current round
we could use the following approach (probably can be simplified):
For a given round either pre-setup or make the first process in the round create a path /round.
The first process
checks that getChildren('/round') returns an empty list
creates an ephemeral znode /round/exec_1
creates a permanent znode /round/start_1
Every other process number N
gets the list of /round's children
checks if /round/start_{N-1} exists
if yes, then checks /round/exec_{N-1} and waits for it to be deleted (using watcher)
if not, wait for it to appear (using watcher), then act as if the node was already there.
If the answers to the questions above are different, this protocol can be adjusted.
But the general idea here is to define how we indicate that a process is good to go: we register the fact that a previous process has started and then wait for it to terminate.

Related

Make pedestrians divert to another queue if QueueTime Exceeds a preselected Value

Edited Version:
I'm actually modelling an airport check-in terminal. It works fine so far, but additional I'm still trying to implement a function, that allows my pedestrians not to enter the service-queue if the queue time exceeds a preselected value (e.g. already 15 Passengers in the queue) and therefore walks to some kind of backup Service that opens during this busy times.
Here is my approach:
Variable QueueSize returns permanently the actual Number of Passengers in the Queue.
Every time a ped enters the pedservice block CheckInEco, the function waitingTime() starts:
QueueSize = CheckInEco.size();
if (QueueSize > 15) CheckInEco.cancel(ped)
So, as soon as there are more than 15 Agents in the queue, number 16 should bypass and move to an alternate ServiceBlock, which I would connect to the ccl Port of the CheckInEco Service. But when building the model, I get this message: ped cannot be resolved to a variable?
According to Anylogic Help, it should be possible to use this cancel - call, but I'm not really experienced with it.. Maybe, someone can help me out?
You can simply use a select output block to prevent pedestrians from going into the service block if there are more than 16 pedestrians already in.
Your original question had to do with waiting time, you should follow the exact same approach. But with waiting time it gets more complicated since you don't want to take the average waiting time from the start of the simulation.... so you need to decide if you want to take the last 10 minutes, 1 hour etc and do you want to include the current waiting time of agents in the queue. Since this is the the questions anymore I am not going to add it here, perhaps ask a new question if this is still the case.

Is there a way of assigning an int number to different instances of stateless services?

I'm building a solution where we'll have a (service-fabric) stateless service deployed to K instances. This service is tasked with some workload (like querying) and I want to split the workload between them as evenly as I can - and I want to make this a dynamic solution, which means if I decide to go from K instances to N instances tomorrow, I want the workload splitting to happen in a way that it will automatically distribute the load across N instances now. I don't have any partitions specified for this service.
As an example -
Let's say I'd like to query a database to retrieve a particular chunk of the records. I have 5 nodes. I want these 5 nodes to retrieve different 1/5th of the set of records. This can be achieved through some query logic like (row_id % N == K) where N is the total number of instances and K is the unique instance_number.
I was hoping to leverage FabricRuntime.GetNodeContext().NodeId - but this returns a guid which is not overly useful.
I'm looking for a way where I can deterministically say it's instance number M out of N (I need to be able to name the instances through 1..N) - so I can set my querying logic according to this. One of the requirements is if that instance goes down / crashes etc... when SF automatically restarts it, it should still identify as the same instance id - so that 2 or more nodes doesn't query the same set of results.
What is the best of solving this problem? Is there a solution which involves pure configuration through ApplicationManifest.xml or ServiceManifest.xml?
There is no out of the box solution for your problem, but it can be easily done in many different ways.
The simplest way is using the Queue-Based Load Leveling pattern in conjunction with Competing Consumers pattern.
It consists of creating a queue, add the work to the queue, and each instance get one message to process this work, if one instance goes down and the message is not processed, it goes back to the queue and another instance pick it up.
This way you don't have to worry about the number of instances running, failures and so on.
Regarding the work being put in the queue, it will depend if you want to to do batch processing or process item by item.
Item by item, you put one message in the queue for each item being processed, this is a simple way to handle the work and each instance process one message at time, or multiple messages in parallel.
In batch, you can put a message that represents a list of items to be processed and each instance process that batch until completed, this is a bit trickier because you might have to handle the progress of the work being done, in case of failure, the next time you can continue from where it stopped.
The queue approach is a reactive design, in this case the work need to be put in the queue to trigger the processing, If you want a proactive approach and need to keep track of which work goes to who, you probably might be better of using some other approach, like a Leasing mechanism, where each instance acquire a lease that belongs to the instance until it releases the lease, this would more suitable when you work with partitioned data or other mechanism where you can easily split the load.
Regarding the issue with the ID, an option would be the InstanceId of the replica you are on, you can reach by StatelessService.Context.InstanceId, it is not a sequential ID, but it is a random number. It is better than using the node id, because you might have multiple partitions on same node and the id would conflict with each other.
If you decide to use named partitions, you could use order in the partition name instead, so each partition would have a sequential name.
Worth mention that service fabric has a limitation that doesn't allow services to have multiple replicas on same node, because of this limitation you might have to design your services with this in mind, otherwise you won't be able to scale out once the limit is reached. Also, the same thread has some discussion about approaches to process multiple distributed items that might give you some ideas.

Conditional restart of supervisord processes?

I've been using supervisord for a while -- outstanding tool. The one use case I haven't been able to figure out is, how to configure jobs to be restarted until a condition is met, then stop restarting.
Example: let's say you have a bunch of work to do, like scaling thousands of images, or servicing millions of requests on a queue. A useful pattern would be to run many workers in parallel to work on that backlog. You could set up a supervisord job that ensures 100 workers are running, and if any of them crash, supervisord will spin up replacements so the pool of workers won't shrink.
That's great until the work is done. Maybe when the backlog is gone, the number of workers should scale down to 1 or 0. Supervisord will keep spinning up the total to be 100 processes, even if each new process checks to see if there's work to be done, sees none, and shuts down very quickly.
Is there a way for a process instance or process family to communicate with supervisord to say, the autoretsart behavior is no longer needed? Better yet, is there a way to scale the number of worker processes up and down based on some condition (like number of files in a directory or ??).
I know it can be done by updating the supervisord.conf file and running supervisorctl reload, but I'd prefer something that's more declarative and self-managing if such a thing exists.
Is there a way for a process instance or process family to communicate with supervisord to say, the autoretsart behavior is no longer needed?
You can wind down an activity by making sure your processes exit with different exitcode(s) when there is no work and making those the expected exitcodes with autorestart=unexpected in the configuration.
Better yet, is there a way to scale the number of worker processes up and down based on some condition (like number of files in a directory or ??).
The trouble is that the automatic state transitions don't allow for getting processes running again from an expected EXITED state. AFAIK the only way to do this is with the XML-RPC API's startProcess, so you would need to write or find an appropriate event listener that watches for your start condition and then uses the API.
An alternate design is to wrap your worker process in an event handler watching PROCESS COMMUNICATION Events and have one normal subprocess communicating new tasks to a pool of event listeners. But that model doesn't currently eliminate a pool of waiting processes when there is no work, it just organizes the control task in a way that may make it easier to separate out task related logic and resource usage.

Shared memory and waiting sessions in Matlab

Doing a regression on a large dataset, I have a huge read-only matrix that I'd like to share among several threads. I've looked at various way of doing this and found sharedmatrix toolkit to be exactly what I need. Reading through the tutorial, I came up with the following setup:
Session 0 - just loads up the matrix and makes it available
Session 1..n - worker sessions
The problem is that Session 0 has to finish only after all the other n sessions have finished. Do you know how to make a session wait? The best solution would be to make it wait until I kill it as I'm running the scripts on a remote linux system and am not connected to it all the time.
UPDATE:
In the end I've changed my approach to the problem, after reading this part of the tutorial:
The “free” directive marks the shared memory segment for deletion.
Note: it is not actually deleted until every attached session
explicitly detaches or is terminated. As soon as the last session
detaches, the system will reclaim the allocated segment.
This means that I've created one "master" session that loads up the matrix, makes it available and then starts its own computation, and several "slave" sessions that use the shared matrix. Even if the master session finishes earlier, it causes no problem to the slave sessions as the shared matrix remains in the memory until the last process that uses it is terminated.
If you really want to have it wait indefinitely, use Eitan's solution based on pause. More generally, something like labBarrier is what you should use to perform this kind of synchronization.

MPI Task Scheduling

I want to develop a task scheduler using MPI where there is a single master processor and there are worker/client processors. Each worker has all the data it needs to compute, but gets the index to work on from the master. After the computation the worker returns some data to the master. The problem is that some processes will be fast and some will be slow.
If I run a loop so that at each iteration the master sends and receives (blocking/non-blocking) data then it can't proceed to next step till it has received data from the current worker from the previous index assigned to it. The bottom line is if a worker takes too long to compute then it becomes the limiting factor and the master can't move on to assign an index to the next worker even if non-blocking techniques are used. Is it possible to skip assigning to a worker and move on to next.
I'm beginning to think that MPI might not be the paradigm to do this. Would python be a nice platform to do task scheduling?
This is absolutely possible using MPI_Irecv() and MPI_Test(). All the master process needs to do is post a non-blocking receive for each worker process, then in a loop test each one for incoming data. If a process is done, send it a new index, post a new non-blocking receive for it, and continue.
One MPI_IRecv for each process is one solution. This has the downside of needing to cancel unmatched MPI_IRecv when the work is complete.
MPI_ANY_SOURCE is an alternate path. This will allow the manager process to have a single MPI_IRecv outstanding at any given time, and the "next" process to MPI_Send will be matched with MPI_ANY_SOURCE. This has the downside of several ranks blocking in MPI_Send when there is no additional work to be done. Some kind of "nothing more to do" signal needs to be worked out, so the ranks can do a clean exit.