Problems using batch with matlabpool - matlab

I want to use some parallel features in Matlab.
And execute following command.
matlabpool open local 12;
batch(funcname,1,{arg},'PathDependencies',p,'Matlabpool',1);
Then all processes keep silent for the rest of time...
But without opening matlabpool. It would finish normally.
Is there any conflicts between the use of matlabpool and batch?

The matlabpool command runs a parallel job on the local scheduler to give you workers on which to run the body of your parfor loops and spmd blocks. This means that while matlabpool is open, the number of workers available to the local scheduler is reduced. Then, when you try to run a batch job, it can only run when there are workers free.
You can find out how many running jobs you have on your local scheduler either using the "job monitor" from the "Parallel" desktop menu item (your matlabpool session would show up there as a job in state running with 12 tasks), or by executing the following piece of code:
s = findResource( 'scheduler', 'Type', 'local' );
[pending, queued, running, finished] = findJob(s);
running

if you want to batch and parfor at the same time, open one less worker with matlabpool than you otherwise would. so 11 in your case. if you batch first and then matlabpool, it will automatically do this, but not vice versa.
to see the queue:
c=parcluster
c.Jobs
interestingly, if you open up a second matlab instance, you can get another 12 workers. but strangely not with a third. makes sense though i guess as if you actually use them all it will thrash.

Related

Manually Stop Jupyter Kernel and Prevent from Restarting

Background
I have created a Jupyter kernel A from which I launch another kernel B. I am doing this in order to audit the execution of kernel B. So when a user selects kernel A from the interface, kernel B is launched in the background which then executes the notebook code. strace is being used to audit the execution. After the audit phase, code, data, and provenance etc. of the program execution are recorded and stored for analysis later on.
Problem
After the notebook program ends, I intend to stop tracing the execution of kernel B. This does not happen unless I stop the execution of kernel B launched internally by kernel A. The only way I have been able to do this is using the kill command as such:
os.kill(os.getpid(), 9)
This does the job but with a side-effect: Jupyter restarts the kernel automatically which means kernel A and B are launched and start auditing the execution again. This causes certain race conditions and overwrites of some files which I want to avoid.
Possible Solution
To my mind, there are two things I can do to resolve this issue:
Exit the kernel B program gracefully so the auditing of the notebook code gets completed and stored. This does not happen with the kill command so would need some other solution
Avoid automatic restart of the kernel, with or without the kill command.
I have looked into different ways to achieve the above two but have not been successful yet. Any advice on achieving either of the above two solutions would be appreciated, or perhaps another way of solving the problem.
have you tried terminating kernel B instead of killing it
using 15 instead of 9
os.kill(os.getpid(), signal.SIGTERM)
#or
os.kill(os.getpid(), 15)
other values for kill are signal.SIGSTOP=23 , signal.SIGHUP=1
Other option could be to insert following code snippet on top of the code
import signal
import sys
def onSigExit():
sys.exit(0)
#os._exit(0)
signal.signal(signal.SIGINT, onSigExit)
signal.signal(signal.SIGTERM, onSigExit)
now you should be able to send
os.kill(os.getpid(),15)
and it should exit gracefully and not restart

Running multiple parpool jobs on a cluster

I am trying to run a number of MATLAB jobs on a cluster.
Since MATLAB saves states and diaries of each parpool job in ~/.matlab/... , when I run multiple jobs on a cluster, (each job using its own parpool), then MATLAB despite the fact that I close every open parpool every time I use one, it gives me errors related to "found 5 pre-existing parallel jobs..."
Is there a way to change the preferences folder of MATLAB for each instance of MATLAB so that this conflict does not arise ?
You need to overwrite JobStorageLocation property with a unique path for each job before starting parallel pool, e.g.
pc = parcluster('local'); % or whatever cluster you're running your jobs on
pc.JobStorageLocation = 'C:\my\unique\job\storage\location';
parpool(pc);

The client lost connection to lab X Matlab

I need help on how to tackle the Matlab error below. After a couple of successful runs I got the error message below in Matlab using parfor.
Opened 2 pools. Send function1 to worker1 and send function2 to worker2. Both functions does some sort of calcs on matrices and generate CSV at the end. It was fine until after a few runs.
The session that parfor is using has shut down
The client lost connection to lab 2. This might be due to network
problems, or the interactive matlabpool job might have errored.
We're using VM machine with a processor Intel Xeon X7560 #2.27GHz (4 processors). The RAM is 16GB. 64-bit OS.
This is part of a batch run. To resolve the issue instead of re-using the pools for every batch iteration. Make sure to "close" it. Then open fresh Matlab pools for every iteration. Seems to be more stable now, although a lot slower than the previous implementation.

Matlab process termination in slurm

I have two questions that to me seem related:
First, is it necessary to explicitly terminate Matlab in my sbatch command? I have looked through several online slurm tutorials, and in some cases the authors include an exit command:
http://www.umbc.edu/hpcf/resources-tara-2013/how-to-run-matlab.html
And in some they don't:
http://www.buffalo.edu/ccr/support/software-resources/compilers-programming-languages/matlab/PCT.html
Second, when creating a parallel pool in a job, I almost always get the following warning:
Warning: Found 4 pre-existing communicating job(s) created by pool that are
running, and 2 communicating job(s) that are pending or queued. You can use
'delete(myCluster.Jobs)' to remove all jobs created with profile local. To
create 'myCluster' use 'myCluster = parcluster('local')'
Why is this happening, and is there any way to avoid it happening to myself and to others because of me?
It depends on how you launch Matlab. Note that your two examples use distinct methods for running a matlab script; the first one uses the -r option
matlab -nodisplay -r "matrixmultiply, exit"
while the second one uses stdin redirection from a file
matlab < runjob.m
In the first solution, the Matlab process will be left running after the script is finished, that is why the exit command is needed there. In the second solution, the Matlab process is terminated as stdin closes when the end of the file is reached.
If you do not end the matlab process, Slurm will kill it when the maximum allocation time is reached, as defined by the --time option in you submission script or by the default cluster (or partition) value.
To avoid the warning you mention, make sure to systematically use matlabpool close at the end of your job. If you have several instances of Matlab running on the same node, and you have a shared home directory, you will probably get the warning anyhow, as I believe the information about open matlab pools is stored in a hidden folder in your home. Rebooting will probably not help, but finding those files and removing them will (be careful though and ask the system administrator).
to avoid your warning, you have to delete
.matlab/local_cluster_jobs/
directory

matlab does not save variables on parallel batch job

I was running (in a cluster) a batch job and in the end I was trying to save results using save(), but I had the following error:
ErrorMessage: The parallel job was cancelled because the task with ID 1
terminated abnormally for the following reason:
Cannot create 'results.mat' because '/home/myusername/experiments' does not exist.
why is that happening? What is the correct way to save variables in a parallel job?
You can use SAVE in the normal way during execution of a parallel job, but you also need to be aware of where you are running. If you are running using the MathWorks jobmanager on the cluster, then depending on the security level set on the jobmanager, you might not have access to the same set of directories as you normally would. More about that stuff here: http://www.mathworks.co.uk/help/mdce/setting-job-manager-security.html