I'm running a distributed job on a cluster. I need to execute a script that sends me an email when the last task finishes (rather, all the tasks are complete). I have my script ready, but I'm not sure how to go about finding task completion. Is there a task ID analogous to labindex?
The reason I want to build this email feature into the job is so that I can just quit MATLAB after submission and collect my data when it's done. That way I won't waste resources pinging it often to get its state.
jobMgr = findResource(parameters for your cluster's job manager...);
job = createJob(jobMgr);
set(job, 'JobData', yourdata);
set(job, 'MaximumNumberOfWorkers', yourmaxworkers);
set(job, 'PathDependencies', yourpathdeps);
set(job, 'FileDependencies', yourfiledeps);
set(job, 'Timeout', yourtimeout);
for m = 1:numjobs
task(m) = createTask(job, #parallelfoo, 1, {m});
% Calls taskFinish when the task completes
set(task(m), 'FinishedFcn', {#taskFinish, m});
end
Elsewhere, you'll have defined a function taskFinish that gets a callback when each task completes.
function taskFinish(taskObj, eventData, tasknum)
disp(['Task ' num2str(tasknum) ' completed']);
end
Note, this code was written for the original release of the Distributed Computing Toolbox (which was subsequently renamed the Parallel Computing Toolbox), so it's possible that there are more elegant ways of accomplishing what you're trying to do. This gets the job done though, with one caveat -- my understanding is that this callback functionality only works if you're running the MATLAB job manager on your cluster (not one of the third party MPI job managers such as TORQUE).
I'm not sure if this answers much of your question, but you can get at the Task ID when running on the workers like so:
t = getCurrentTask();
tid = t.ID;
However, note that most schedulers execute tasks in an arbitrary order...
Related
I would like the workflow to run for 20 minutes....If the running process does not complete within 20 minutes, the workflow should be ended immediately..However, I can only find the timer, but it is used for starting the process after the indicated time which is not I'm looking for...Does anyone know how to specify the duration for the workflow?
[updated to cover complete scenario and cover issues raised by Koushik Sinharoy in comments]
You can achieve it using the timer:
Link one to the Start task and set it to run for 20 minutes in
parallel with your session.
Have a Decision task with Treat the input links as set to OR. This will trigger the decision whenever any of the preceeding task ends, so either your session will get completed or timer runs out (whatever happens first).
Set the Decision condition to $s_your_session.Status = SUCCEEDED.
Link the Decision task to Control Task.
Set Control task to Fail parent.
Add a condition $Decision.Condition = False to the link between Decision task and Control task.
This should be the result:
Start--->s_your_session--\
\ > Decision [OR] ---(False)---> Control Task [Fail parent]
\-->timer-----------/
Thanks Koushik Sinharoy for the remarks below!
I need to achieve the ability to monitor and be able to cancel an ALREADY RUNNING job on queue.
There's a lot of answers about deleting QUEUED jobs, but not on an already running one.
This is the situation: I have a "job", which consists of HUNDREDS OF THOUSANDS rows on a database, that need to be queried ONE BY ONE against a web service.
Every row needs to be picked up, queried against a web service, stored the response and its status updated.
I had that already working as a Command (launching from / outputting to console), but now I need to implement queues in order to allow piling up more jobs from more users.
So far I've seen Horizon (which doesn't runs on Windows due to missing process control libs). However, in some demos seen around it lacks (I believe) a couple things I need:
Dynamically configurable timeout (the whole job may take more than 12 hours, depending on the number of rows to process on the selected job)
Ability to CANCEL an ALREADY RUNNING job.
I also considered the option to generate EACH REQUEST as a new job instead of seeing a "job" as the whole collection of rows (this would overcome the timeout thing), but that would give me a Horizon "pending jobs" list of hundreds of thousands of records per job, and that would kill the browser (I know Redis can handle this without itching at all). Further, I guess is not possible to cancel "all jobs belonging to X tag".
I've been thinking about hitting an API route, fire the job and decouple it from the app, but I'm seeing that this requires forking processes.
For the ability to cancel, I would implement a database with job_id, and when the user hits an API to cancel a job, I'd mark it as "halted". On every loop I would check its status and if it finds "halted" then kill itself.
If I've missed any aspect just holler and I'll add it or clarify about it.
So I'm asking for an advice here since I'm new to Laravel: how could I achieve this?
So I finally came up with this (a bit clunky) solution:
In Controller:
public function cancelJob()
{
$jobs = DB::table('jobs')->get();
# I could use a specific ID and user owner filter, etc.
foreach ($jobs as $job) {
DB::table('jobs')->delete($job->id);
}
# This is a file that... well, it's self explaining
touch(base_path(config('files.halt_process_signal')));
return "Job cancelled - It will stop soon";
}
In job class (inside model::chunk() function)
# CHECK FOR HALT SIGNAL AND [OPTIONALLY] STOP THE PROCESS
if ($this->service->shouldHaltProcess()) {
# build stats, do some cleanup, log, etc...
$this->halted = true;
$this->service->stopProcess();
# This FALSE is what it makes the chunk() method to stop looping
return false;
}
In service class:
/**
* Checks the existence of the 'Halt Process Signal' file
*
* #return bool
*/
public function shouldHaltProcess() :bool
{
return file_exists($this->config['files.halt_process_signal']);
}
/**
* Stop the batch process
*
* #return void
*/
public function stopProcess() :void
{
logger()->info("=== HALT PROCESS SIGNAL FOUND - STOPPING THE PROCESS ===");
$this->deleteHaltProcessSignalFile();
return ;
}
It doesn't looks quite elegant, but it works.
I've surfed the whole web and many goes for Horizon or other tools that doesn't fit my case.
If anyone has a better way to achieve this, it's welcome to share.
Laravel queue have 3 important config:
1. retry_after
2. timeout
3. tries
See more: https://laravel.com/docs/5.8/queues
Dynamically configurable timeout (the whole job may take more than 12
hours, depending on the number of rows to process on the selected job)
I think you can config timeout + retry_after about 24h.
Ability to CANCEL an ALREADY RUNNING job.
Delete job in jobs table
Delete process by process id in your server
Hope it help you :)
When an out-of-memory error is raised in a parfor, is there any way to kill only one Matlab slave to free some memory instead of having the entire script terminate?
Here is what happens by default when an out-of-memory error occurs in a parfor: the script terminated, as shown in the screenshot below.
I wish there was a way to just kill one slave (i.e. removing a worker from parpool) or stop using it to release as much memory as possible from it:
If you get a out of memory in the master process there is no chance to fix this. For out of memory on the slave, this should do it:
The simple idea of the code: Restart the parfor again and again with the missing data until you get all results. If one iteration fails, a flag (file) is written which let's all iterations throw an error as soon as the first error occurred. This way we get "out of the loop" without wasting time producing other out of memory.
%Your intended iterator
iterator=1:10;
%flags which indicate what succeeded
succeeded=false(size(iterator));
%result array
result=nan(size(iterator));
FLAG='ANY_WORKER_CRASHED';
while ~all(succeeded)
fprintf('Another try\n')
%determine which iterations should be done
todo=iterator(~succeeded);
%initialize array for the remaining results
partresult=nan(size(todo));
%initialize flags which indicate which iterations succeeded (we can not
%throw erros, it throws aray results)
partsucceeded=false(size(todo));
%flag indicates that any worker crashed. Have to use file based
%solution, don't know a better one. #'
delete(FLAG);
try
parfor falseindex=1:sum(~succeeded)
realindex=todo(falseindex);
try
% The flag is used to let all other workers jump out of the
% loop as soon as one calculation has crashed.
if exist(FLAG,'file')
error('some other worker crashed');
end
% insert your code here
%dummy code which randomly trowsexpection
if rand<.5
error('hit out of memory')
end
partresult(falseindex)=realindex*2
% End of user code
partsucceeded(falseindex)=true;
fprintf('trying to run %d and succeeded\n',realindex)
catch ME
% catch errors within workers to preserve work
partresult(falseindex)=nan
partsucceeded(falseindex)=false;
fprintf('trying to run %d but it failed\n',realindex)
fclose(fopen(FLAG,'w'));
end
end
catch
%reduce poolsize by 1
newsize = matlabpool('size')-1;
matlabpool close
matlabpool(newsize)
end
%put the result of the current iteration into the full result
result(~succeeded)=partresult;
succeeded(~succeeded)=partsucceeded;
end
After quite bit of research, and a lot of trial and error, I think I may have a decent, compact answer. What you're going to do is:
Declare some max memory value. You can set it dynamically using the MATLAB function memory, but I like to set it directly.
Call memory inside your parfor loop, which returns the memory information for that particular worker.
If the memory used by the worker exceeds the threshold, cancel the task that worker was working on. Now, here it get's a bit tricky. Depending on the way you're using parfor, you'll either need to delete or cancel either the task or worker. I've verified that it works with the code below when there is one task per worker, on a remote cluster.
Insert the following code at the beginning of your parfor contents. Tweak as necessary.
memLimit = 280000000; %// This doesn't have to be in parfor. Everything else does.
memData = memory;
if memData.MemUsedMATLAB > memLimit
task = getCurrentTask();
cancel(task);
end
Enjoy! (Fun question, by the way.)
One other option to consider is that since R2013b, you can open a parallel pool with 'SpmdEnabled' set to false - this allows MATLAB worker processes to die without the whole pool being shut down - see the doc here http://www.mathworks.co.uk/help/distcomp/parpool.html . Of course, you still need to arrange somehow to shutdown the workers.
I have a django project in combination with celery and my need is to be able to schedule tasks dynamically, at some point in the future, with recurrence or not. I need the ability to delete/edit already scheduled tasks
So to achieve this at the beginning I started using django-celery with DatabaseScheduler to store some PeriodicTasks (with expiration) to the database as it is described more or less here
In this way if I close my app and start it again my schedules are still there
My problem though still remains since I cannot utilize the eta and schedule a task at some point in the future. Is it possible somehow to dynamically schedule a task with eta?
A second question of mine is whether I can schedule a once off task, like schedule it to run e.g. at 2015-05-15 15:50:00 (that is why I'm trying to use eta)
Finally, I will be scheduling some thousants of notifications, is celery beat capable to handle this number of scheduled tasks? some of them once-off while others being periodic? Or do I have to go with a more advanced solution such as APScheduler
Thank you
I've faced the same problem yesterday. My ugly temporary solution is:
# tasks.py
from djcelery.models import PeriodicTask, IntervalSchedule
from datetime import timedelta, datetime
from django.utils.timezone import now
...
#app.task
def schedule_periodic_task(task='app.tasks.task', task_args=[], task_kwargs={},
interval=(1, 'minute'), expires=now()+timedelta(days=365*100)):
PeriodicTask.objects.filter(name=task+str(task_args)+str(task_kwargs)).delete()
task = PeriodicTask.objects.create(
name=task+str(task_args)+str(task_kwargs), task=task,
args=str(task_args),
kwargs=str(task_kwargs),
interval=IntervalSchedule.objects.get_or_create(
every=interval[0],
period=interval[1])[0],
expires=expires,
)
task.save()
So, if you want to schedule periodic task with eta, you shoud
# anywhere.py
schedule_periodic_task.apply_async(
kwargs={'task': 'grabber.tasks.grab_events',
'task_args': [instance.xbet_id], 'task_kwargs': {},
'interval': (10, 'seconds'),
'expires': instance.start + timedelta(hours=3)},
eta=instance.start,
)
schedule task with eta, which creates periodic task. Ugly:
deal with raw.task.name
strange period (n, 'interval')
Please, let me know, if you designed some pretty solution.
I am an RTOS newbie and I am creating a simple real time system for automotive
I am wondering if it possible to create a task inside another task.
I tried to do this by the following method but it doesn't work.
void vTask1 { *pvParameters){
unsigned portBASE_TYPE taskPriority;
taskPriority=uxTaskPriorityGet( NULL );
char x;
while (1){
x= 5 ;
if (x==5)
xTaskCreate( vTask2 , "task2", 1000, "task2 is running", taskPriority+5 , NULL );
}
when I debug that code it hangs at xTaskCreate without executing the new task
and I searched the manual and the internet for something about this but I didn't find any.
would anyone tell me is that possible to do in RTOS or I am doing it in a wrong way?
Tasks can be created before the scheduler has been started (from main), or after the scheduler has been started (from another task). The xTaskCreate() API documentation is here:
http://www.freertos.org/a00125.html . You will also find a set of demo tasks that demonstrate creating and deleting tasks from another task in the main FreeRTOS .zip file download. Look in the FreeRTOS/Demo/Common/Minimal/death.c file (death for suicidal tasks as they delete themselves after creation).
If xTaskCreate() returns NULL then you will probably have run out of heap space. See http://www.freertos.org/a00111.html. I think most of the hundreds or pre-configured examples that come in the zip file download have comments to this effect.
Check the return value of xTaskCreate api.
one more thing the second task which you are creating is vtask2 which is having lower priority than vtask1 the one who is creating . And vtask1 is running in while(1) scheduler will not schedule vtask2.
you can delay or suspend the vtask1 after creating vtask2.
then vtask2 may execute.