Running multiple parpool jobs on a cluster - matlab

I am trying to run a number of MATLAB jobs on a cluster.
Since MATLAB saves states and diaries of each parpool job in ~/.matlab/... , when I run multiple jobs on a cluster, (each job using its own parpool), then MATLAB despite the fact that I close every open parpool every time I use one, it gives me errors related to "found 5 pre-existing parallel jobs..."
Is there a way to change the preferences folder of MATLAB for each instance of MATLAB so that this conflict does not arise ?

You need to overwrite JobStorageLocation property with a unique path for each job before starting parallel pool, e.g.
pc = parcluster('local'); % or whatever cluster you're running your jobs on
pc.JobStorageLocation = 'C:\my\unique\job\storage\location';
parpool(pc);

Related

How can I implement MATLAB parallel computing on more than one node

On a supercomputing server, I assign a task to 2 nodes, with 48 workers maximum.
But the MATLAB command parpool(36) does not work. The error message is as follows:
Error using parpool (line 113)
You requested a minimum of 36 workers, but the cluster "local" has the NumWorkers property set to allow a maximum of 24 workers. To run a communicating job on more workers than this (up to a maximum of 512 for the Local cluster), increase the value of the NumWorkers property for the cluster. The default value of NumWorkers for a Local cluster is the number of cores on the local machine.
According to this tip, I added two commands to my code as shown below:
myCluster = parcluster;
myCluster.NumWorkers = 36;
After that, parpool(36) runs without errors, but the computing time remains unchanged. So I guess that these 2 commands can't help me. How can I use 36 cores on 2 nodes?
Your code as written is using the "local" cluster type in MATLAB. This is designed to run on the cores of a single machine. If you want to run on a "real" cluster, you need to submit a job (or launch a parpool) using a cluster profile that submits to a MATLAB Parallel Server cluster.
For instance, if you want to use an existing cluster that is running SLURM, you would follow the instructions here to install stuff and then here to set up a cluster profile. Once you've got a cluster profile of this sort, you can either submit a batch job to run there like so:
clus = parcluster('my slurm profile');
j = batch(#myFunction, 3, {in1, in2}, 'Pool', 36); % 3 is nargout for myFunction
Or you can run an interactive pool on your client accessing the cluster like so
parpool('my slurm profile', 36);

Matlab process termination in slurm

I have two questions that to me seem related:
First, is it necessary to explicitly terminate Matlab in my sbatch command? I have looked through several online slurm tutorials, and in some cases the authors include an exit command:
http://www.umbc.edu/hpcf/resources-tara-2013/how-to-run-matlab.html
And in some they don't:
http://www.buffalo.edu/ccr/support/software-resources/compilers-programming-languages/matlab/PCT.html
Second, when creating a parallel pool in a job, I almost always get the following warning:
Warning: Found 4 pre-existing communicating job(s) created by pool that are
running, and 2 communicating job(s) that are pending or queued. You can use
'delete(myCluster.Jobs)' to remove all jobs created with profile local. To
create 'myCluster' use 'myCluster = parcluster('local')'
Why is this happening, and is there any way to avoid it happening to myself and to others because of me?
It depends on how you launch Matlab. Note that your two examples use distinct methods for running a matlab script; the first one uses the -r option
matlab -nodisplay -r "matrixmultiply, exit"
while the second one uses stdin redirection from a file
matlab < runjob.m
In the first solution, the Matlab process will be left running after the script is finished, that is why the exit command is needed there. In the second solution, the Matlab process is terminated as stdin closes when the end of the file is reached.
If you do not end the matlab process, Slurm will kill it when the maximum allocation time is reached, as defined by the --time option in you submission script or by the default cluster (or partition) value.
To avoid the warning you mention, make sure to systematically use matlabpool close at the end of your job. If you have several instances of Matlab running on the same node, and you have a shared home directory, you will probably get the warning anyhow, as I believe the information about open matlab pools is stored in a hidden folder in your home. Rebooting will probably not help, but finding those files and removing them will (be careful though and ask the system administrator).
to avoid your warning, you have to delete
.matlab/local_cluster_jobs/
directory

matlab parallel processing on several nodes

I have studied pages and discussion on matlab processing, but I still don't know how to distribute my program over several nodes(not cores). In the cluster which I am using, there are 10 nodes available, and inside each node there are 8 cores available. When Using "parfor" inside each node (locally between 8 cores), the parallel-ization works fine. But when using several nodes, I think that (not sure how to verify this) it doesn't work well. Here is a piece of program which I run on the cluster:
function testPool2()
disp('This is a comment')
disp(['matlab number of cores : ' num2str(feature('numCores'))])
matlabpool('open',5);
disp('This is another comment!!')
tic;
for i=1:10000
b = rand(1,1000);
end;
toc
tic;
parfor i=1:10000
b = rand(1,1000);
end;
toc
end
And the outputs is :
This is a comment
matlab number of cores : 8
Starting matlabpool using the 'local' profile ... connected to 5 labs.
This is another comment!!
Elapsed time is 0.165569 seconds.
Elapsed time is 0.649951 seconds.
{Warning: Objects of distcomp.abstractstorage class exist - not clearing this
class
or any of its super-classes}
{Warning: Objects of distcomp.filestorage class exist - not clearing this class
or any of its super-classes}
{Warning: Objects of distcomp.serializer class exist - not clearing this class
or any of its super-classes}
{Warning: Objects of distcomp.fileserializer class exist - not clearing this
class
or any of its
super-classes}
The program is first compiled using "mcc -o out testPool2.m" and then transferred to an scratch drive of a server. Then I submit the job using Microsoft HPC pack 2008 R2. Also note that I don't have access to the graphical interface of the MATLAB installed on each of the nodes. I can only submit jobs using MSR HPC Job Manager (see this: http://blogs.technet.com/b/hpc_and_azure_observations_and_hints/archive/2011/12/12/running-matlab-in-parallel-on-a-windows-cluster-using-compiled-matlab-code-and-the-matlab-compiler-runtime-mcr.aspx )
Based on the above output we can see that, the number of the available cores is 8; so I infer that the "matlabpool" only works for local cores in a machine; not between nodes (separate computers connected to each other)
So, any ideas how I can generalize my for loop ("parfor") to nodes ?
PS. I have no idea what are the warnings at the end of the output !
In order to run MATLAB on multiple nodes, the distributed computing server is needed in addition to the parallel computing toolbox. The distributing computing server must be installed and correctly configured on all of the nodes in the cluster. Normally MATLAB distributed server comes with shell scripts for launching parallel MATLAB jobs on, multiple nodes based on scheduler and cluster setup.
Without access to the distributed computing server, MATLAB can only be run on a single node. It would be valuable to verify with the cluster administrator that the distributed computing server is setup and running correctly; in some cases the administrators of these servers even have example scripts for launching and running jobs common to their user base, e.g. MATLAB
Here is a link to documentation on the Distributed Computing Server:
http://www.mathworks.com/help/mdce/index.html?searchHighlight=distributed%20computing%20server

matlab does not save variables on parallel batch job

I was running (in a cluster) a batch job and in the end I was trying to save results using save(), but I had the following error:
ErrorMessage: The parallel job was cancelled because the task with ID 1
terminated abnormally for the following reason:
Cannot create 'results.mat' because '/home/myusername/experiments' does not exist.
why is that happening? What is the correct way to save variables in a parallel job?
You can use SAVE in the normal way during execution of a parallel job, but you also need to be aware of where you are running. If you are running using the MathWorks jobmanager on the cluster, then depending on the security level set on the jobmanager, you might not have access to the same set of directories as you normally would. More about that stuff here: http://www.mathworks.co.uk/help/mdce/setting-job-manager-security.html

Problems using batch with matlabpool

I want to use some parallel features in Matlab.
And execute following command.
matlabpool open local 12;
batch(funcname,1,{arg},'PathDependencies',p,'Matlabpool',1);
Then all processes keep silent for the rest of time...
But without opening matlabpool. It would finish normally.
Is there any conflicts between the use of matlabpool and batch?
The matlabpool command runs a parallel job on the local scheduler to give you workers on which to run the body of your parfor loops and spmd blocks. This means that while matlabpool is open, the number of workers available to the local scheduler is reduced. Then, when you try to run a batch job, it can only run when there are workers free.
You can find out how many running jobs you have on your local scheduler either using the "job monitor" from the "Parallel" desktop menu item (your matlabpool session would show up there as a job in state running with 12 tasks), or by executing the following piece of code:
s = findResource( 'scheduler', 'Type', 'local' );
[pending, queued, running, finished] = findJob(s);
running
if you want to batch and parfor at the same time, open one less worker with matlabpool than you otherwise would. so 11 in your case. if you batch first and then matlabpool, it will automatically do this, but not vice versa.
to see the queue:
c=parcluster
c.Jobs
interestingly, if you open up a second matlab instance, you can get another 12 workers. but strangely not with a third. makes sense though i guess as if you actually use them all it will thrash.