How does one create a cluster from different servers in MATLAB? - matlab

I have servers A, B & C with 8 cores each. Right now, I'm running jobs in parallel on a single server with 8 workers. Is there any way I can harness the power of all three for a single job? The servers are all inter-accessible via ssh (all three exist behind a gateway, so no password required either)

I'm going to assume you're currently using Parallel Computing Toolbox. To use multiple servers together, you need the following things:
MATLAB Distributed Computing Server licences for the MATLAB workers running on the machines.
Some sort of scheduler to schedule the jobs across the machines. MDCS comes with a basic scheduler, call the "Jobmanager". There are also various freely available schedulers for Linux systems such as Torque.
The installation instructions for MDCS are quite detailed and will lead you through all the stages you need to complete to get parallel jobs running.

Related

kubernetes - How do I get one job to work with multiple nodes?

Currently, when I create and run deployment, I only work on one node.
I want to work on one task at the same time using Kubernetes.
I want all nodes to work like one computer.
Kubernetes is about managing containers and scheduling them to run across a cluster, not about “jobs” per se. Have a look at MapReduce and Apache Spark.
First you need to understand more about Kubernetes and why your understanding might be a bit misleading for you concept. Kubernetes is an container orchestration tool that automates many of the manual processes involved in deploying, managing, and scaling containerized applications.
In other words, you can cluster together groups of hosts running Linux containers, and K8s helps you manage those clusters. To process some kind of job, data you will need a software that runs on kubernetes.
The next step that you might want to look into is distributed computing concept and distributed computing model called MapReduce.
MapReduce was introduce by Google to meet the demand of large set of users for its applications. Its used to write write scalable applications that can do parallel processing to process a large amount of data on a large cluster of commodity hardware servers. Hadoop is software that has adopted MapReduce and is capable of running it`s programs in various languages (Python, Ruby, C++).
Take a look on this medium article about distributed computing system based on MapReduce and Kubernetes.

Messaging between torque jobs in a cluster

So, I need to submit computation intensive jobs (deep neural network training) to a torque cluster that lease computation time on, and I need to exchange a few megs of large float arrays every few minutes between the active nodes, as the nodes need to be working on the most recent version of the neural network in order to train it well.
I was wondering if there were any good communication options, at least to tell each active job its sisters jobs' ips so it can connect to them by tcp. The nodes don't have access to the internet, and we can't have daemons working on the job submitting server.
The only options that I see would be:
some message passing option on Torque (I'm am fairly noob at torque)
the very error prone option of using files to communicate, which I hate.
a way to query the ips of the active nodes from the server.
There are a variety of ways to exchange information between nodes on a cluster, depending on the architecture of the cluster. Torque is a resource manager, so if the job is being submitted to the cluster using a batch script there are a few environment variables that should be able to give you the hostnames or IP addresses of the nodes being used on a job.
The exact syntax for finding the IP addresses and/or hostnames will depend on the scheduler/workload manager being used with Torque on your cluster. This link has documentation for the PBS Works workload manager.
Parallel communication between nodes can be achieved in a variety of ways and will be partly dependent on the hardware available in the cluster. Using MPI is one of the most common ways to parallelize code for use on a cluster and many implementations support multiple high-performance fabrics/interconnect systems like Infiniband. Some useful introductions to the different types of parallelism can be found here.
As an alternative to MPI Remote Direct Memory Access(RDMA) can be used to pass and access information between nodes. If the cluster has Infiniband network adapters looking into the IB-Verbs API from the vendor would be an additional option for passing data between nodes.

Run multiple instance of a process on a number of servers

I would like to run multiple instances of a randomized algorithm. For performance reason, I'd like to distribute the tasks on several machines.
Typically, I run my program as follows:
./main < input.txt > output.txt
and it takes about 30 minutes to return a solution.
I would like to run as many instances of this as possible, and ideally not change the code of the program. My questions are:
1 - What online services offer computing resources that would suit my need?
2 - Practically, how should I launch remotely all the processes, get notified of the termination, and then aggregate the results (basically, pick up the best solution). Is there a simple framework that I could use or should I look into ssh-based scripting?
1 - What online services offer computing resources that would suit my need?
Amazon EC2.
2 - Practically, how should I launch remotely all the processes, get notified of the termination, and then aggregate the results (basically, pick up the best solution). Is there a simple framework that I could use or should I look into ssh-based scripting?
Amazon EC2 has an API for launching virtual machines. Once they're launched, you can indeed use ssh to control jobs, and I would recommend this solution. I would expect that other softwares for distributed job management exist, but they aren't likely to be any simpler to configure than ssh.

matlab parallel processing on several nodes

I have studied pages and discussion on matlab processing, but I still don't know how to distribute my program over several nodes(not cores). In the cluster which I am using, there are 10 nodes available, and inside each node there are 8 cores available. When Using "parfor" inside each node (locally between 8 cores), the parallel-ization works fine. But when using several nodes, I think that (not sure how to verify this) it doesn't work well. Here is a piece of program which I run on the cluster:
function testPool2()
disp('This is a comment')
disp(['matlab number of cores : ' num2str(feature('numCores'))])
matlabpool('open',5);
disp('This is another comment!!')
tic;
for i=1:10000
b = rand(1,1000);
end;
toc
tic;
parfor i=1:10000
b = rand(1,1000);
end;
toc
end
And the outputs is :
This is a comment
matlab number of cores : 8
Starting matlabpool using the 'local' profile ... connected to 5 labs.
This is another comment!!
Elapsed time is 0.165569 seconds.
Elapsed time is 0.649951 seconds.
{Warning: Objects of distcomp.abstractstorage class exist - not clearing this
class
or any of its super-classes}
{Warning: Objects of distcomp.filestorage class exist - not clearing this class
or any of its super-classes}
{Warning: Objects of distcomp.serializer class exist - not clearing this class
or any of its super-classes}
{Warning: Objects of distcomp.fileserializer class exist - not clearing this
class
or any of its
super-classes}
The program is first compiled using "mcc -o out testPool2.m" and then transferred to an scratch drive of a server. Then I submit the job using Microsoft HPC pack 2008 R2. Also note that I don't have access to the graphical interface of the MATLAB installed on each of the nodes. I can only submit jobs using MSR HPC Job Manager (see this: http://blogs.technet.com/b/hpc_and_azure_observations_and_hints/archive/2011/12/12/running-matlab-in-parallel-on-a-windows-cluster-using-compiled-matlab-code-and-the-matlab-compiler-runtime-mcr.aspx )
Based on the above output we can see that, the number of the available cores is 8; so I infer that the "matlabpool" only works for local cores in a machine; not between nodes (separate computers connected to each other)
So, any ideas how I can generalize my for loop ("parfor") to nodes ?
PS. I have no idea what are the warnings at the end of the output !
In order to run MATLAB on multiple nodes, the distributed computing server is needed in addition to the parallel computing toolbox. The distributing computing server must be installed and correctly configured on all of the nodes in the cluster. Normally MATLAB distributed server comes with shell scripts for launching parallel MATLAB jobs on, multiple nodes based on scheduler and cluster setup.
Without access to the distributed computing server, MATLAB can only be run on a single node. It would be valuable to verify with the cluster administrator that the distributed computing server is setup and running correctly; in some cases the administrators of these servers even have example scripts for launching and running jobs common to their user base, e.g. MATLAB
Here is a link to documentation on the Distributed Computing Server:
http://www.mathworks.com/help/mdce/index.html?searchHighlight=distributed%20computing%20server

Connecting laptop(s)/desktop(s) to form a MATLAB computing cluster?

I have experience running parallel jobs on a remote cluster, and parallel (parfor) jobs on a single local machine, but never tried making a cluster of my own. I have access couple of laptops/desktops/servers (root access on all except one server), and was wondering if I could connect them all (or some) to form a local cluster (will have about 30 cores total).
Once you move beyond working with one machine, you move license types from a parallel computing toolbox to a Distributed Computing Server license. The licenses are available in clusters from 8 workers and up. List price on a 8 worker cluster is $6K, 32 workers are $21K. You can get more information on the Mathworks product page. Also note that submitting jobs to the workers requires the Parallel Computing Toolbox.
Once you have the worker licenses the only supported way to distribute jobs to the workers is through a scheduler. The server licenses come with a basic Mathworks scheduler that does have some limitations, but is ideal for single users or small groups. Beyond that you would need to go with one of the higher end schedulers such as LSF. A full list of supported schedulers is on the product page. Moving from a PCT setup on a single machine to a distributed setup can be fairly involved.
Are you prepared to pay the license cost for this? You can use local clusters (up to 8) using 1 copy of the parallel computing toolbox license. But to use distributed clusters, you need a distributed computing toolbox for each "node" (processor core) on the cluster. I'm not familiar with how to set this up. I know that I have access to a few of these clusters, and I also use local clusters extensively. We opted to not create our own distributed cluster for this reason. We also have data that shows that distributed clusters were slow for our particular tasks (a lot of file io was happening in our case).
I know this doesn't answer your question, just a few things to think about.