Connecting laptop(s)/desktop(s) to form a MATLAB computing cluster? - matlab

I have experience running parallel jobs on a remote cluster, and parallel (parfor) jobs on a single local machine, but never tried making a cluster of my own. I have access couple of laptops/desktops/servers (root access on all except one server), and was wondering if I could connect them all (or some) to form a local cluster (will have about 30 cores total).

Once you move beyond working with one machine, you move license types from a parallel computing toolbox to a Distributed Computing Server license. The licenses are available in clusters from 8 workers and up. List price on a 8 worker cluster is $6K, 32 workers are $21K. You can get more information on the Mathworks product page. Also note that submitting jobs to the workers requires the Parallel Computing Toolbox.
Once you have the worker licenses the only supported way to distribute jobs to the workers is through a scheduler. The server licenses come with a basic Mathworks scheduler that does have some limitations, but is ideal for single users or small groups. Beyond that you would need to go with one of the higher end schedulers such as LSF. A full list of supported schedulers is on the product page. Moving from a PCT setup on a single machine to a distributed setup can be fairly involved.

Are you prepared to pay the license cost for this? You can use local clusters (up to 8) using 1 copy of the parallel computing toolbox license. But to use distributed clusters, you need a distributed computing toolbox for each "node" (processor core) on the cluster. I'm not familiar with how to set this up. I know that I have access to a few of these clusters, and I also use local clusters extensively. We opted to not create our own distributed cluster for this reason. We also have data that shows that distributed clusters were slow for our particular tasks (a lot of file io was happening in our case).
I know this doesn't answer your question, just a few things to think about.

Related

kubernetes - How do I get one job to work with multiple nodes?

Currently, when I create and run deployment, I only work on one node.
I want to work on one task at the same time using Kubernetes.
I want all nodes to work like one computer.
Kubernetes is about managing containers and scheduling them to run across a cluster, not about “jobs” per se. Have a look at MapReduce and Apache Spark.
First you need to understand more about Kubernetes and why your understanding might be a bit misleading for you concept. Kubernetes is an container orchestration tool that automates many of the manual processes involved in deploying, managing, and scaling containerized applications.
In other words, you can cluster together groups of hosts running Linux containers, and K8s helps you manage those clusters. To process some kind of job, data you will need a software that runs on kubernetes.
The next step that you might want to look into is distributed computing concept and distributed computing model called MapReduce.
MapReduce was introduce by Google to meet the demand of large set of users for its applications. Its used to write write scalable applications that can do parallel processing to process a large amount of data on a large cluster of commodity hardware servers. Hadoop is software that has adopted MapReduce and is capable of running it`s programs in various languages (Python, Ruby, C++).
Take a look on this medium article about distributed computing system based on MapReduce and Kubernetes.

how do we choose --nthreads and --nprocs per worker in dask distributed running via helm on kubernetes?

I'm running some I/O intensive Python code on Dask and want to increase the number of threads per worker. I've deployed a Kubernetes cluster that runs Dask distributed via helm. I see from the worker deployment template that the number of threads for a worker is set to the number of CPUs, but I'd like to set the number of threads higher unless that's an anti-pattern. How do I do that?
It looks like from this similar question that I can ssh to the dask scheduler and spin up workers with dask-worker? But ideally I'd be able to configure the worker resources via helm so that I don't have to interact with the scheduler other than submitting jobs to it via the Client.
Kubernetes resource limits and requests should match the --memory-limit and --nthreads parameters given to the dask-worker command. For more information please follow the link 1 (Best practices described on Dask`s official documentation) and 2
Threading in Python is a careful art and is really dependent on your code. To do the easy one, -nprocs should almost certainly be 1, if you want more processes, launch more replicas instead. For the thread count, first remember the GIL means only one thread can be running Python code at a time. So you only get concurrency gains under two main sitations: 1) some threads are blocked on I/O like waiting to hear back from a database or web API or 2) some threads are running non-GIL-bound C code inside NumPy or friends. For the second situation, you still can't get more concurrency than the number of CPUs since that's just how many slots there are to run at once, but the first can benefit from more threads than CPUs in some situations.
There's a limitation of Dask's helm chart that doesn't allow for the setting of --nthreads in the chart. I confirmed this with the Dask team and filed an issue: https://github.com/helm/charts/issues/18708.
In the meantime, use Dask Kubernetes for a higher degree of customization.

Messaging between torque jobs in a cluster

So, I need to submit computation intensive jobs (deep neural network training) to a torque cluster that lease computation time on, and I need to exchange a few megs of large float arrays every few minutes between the active nodes, as the nodes need to be working on the most recent version of the neural network in order to train it well.
I was wondering if there were any good communication options, at least to tell each active job its sisters jobs' ips so it can connect to them by tcp. The nodes don't have access to the internet, and we can't have daemons working on the job submitting server.
The only options that I see would be:
some message passing option on Torque (I'm am fairly noob at torque)
the very error prone option of using files to communicate, which I hate.
a way to query the ips of the active nodes from the server.
There are a variety of ways to exchange information between nodes on a cluster, depending on the architecture of the cluster. Torque is a resource manager, so if the job is being submitted to the cluster using a batch script there are a few environment variables that should be able to give you the hostnames or IP addresses of the nodes being used on a job.
The exact syntax for finding the IP addresses and/or hostnames will depend on the scheduler/workload manager being used with Torque on your cluster. This link has documentation for the PBS Works workload manager.
Parallel communication between nodes can be achieved in a variety of ways and will be partly dependent on the hardware available in the cluster. Using MPI is one of the most common ways to parallelize code for use on a cluster and many implementations support multiple high-performance fabrics/interconnect systems like Infiniband. Some useful introductions to the different types of parallelism can be found here.
As an alternative to MPI Remote Direct Memory Access(RDMA) can be used to pass and access information between nodes. If the cluster has Infiniband network adapters looking into the IB-Verbs API from the vendor would be an additional option for passing data between nodes.

Persistent storage for Apache Mesos

Recently I've discovered such a thing as a Apache Mesos.
It all looks amazingly in all that demos and examples. I could easily imagine how one would run for stateless jobs - that fits to the whole idea naturally.
Bot how to deal with long running jobs that are stateful?
Say, I have a cluster that consists of N machines (and that is scheduled via Marathon). And I want to run a postgresql server there.
That's it - at first I don't even want it to be highly available, but just simply a single job (actually Dockerized) that hosts a postgresql server.
1- How would one organize it? Constraint a server to a particular cluster node? Use some distributed FS?
2- DRBD, MooseFS, GlusterFS, NFS, CephFS, which one of those play well with Mesos and services like postgres? (I'm thinking here on the possibility that Mesos/marathon could relocate the service if goes down)
3- Please tell if my approach is wrong in terms of philosophy (DFS for data servers and some kind of switchover for servers like postgres on the top of Mesos)
Question largely copied from Persistent storage for Apache Mesos, asked by zerkms on Programmers Stack Exchange.
Excellent question. Here are a few upcoming features in Mesos to improve support for stateful services, and corresponding current workarounds.
Persistent volumes (0.23): When launching a task, you can create a volume that exists outside of the task's sandbox and will persist on the node even after the task dies/completes. When the task exits, its resources -- including the persistent volume -- can be offered back to the framework, so that the framework can launch the same task again, launch a recovery task, or launch a new task that consumes the previous task's output as its input.
Current workaround: Persist your state in some known location outside the sandbox, and have your tasks try to recover it manually. Maybe persist it in a distributed filesystem/database, so that it can be accessed from any node.
Disk Isolation (0.22): Enforce disk quota limits on sandboxes as well as persistent volumes. This ensures that your storage-heavy framework won't be able to clog up the disk and prevent other tasks from running.
Current workaround: Monitor disk usage out of band, and run periodic cleanup jobs.
Dynamic Reservations (0.23): Upon launching a task, you can reserve the resources your task uses (including persistent volumes) to guarantee that they are offered back to you upon task exit, instead of going to whichever framework is furthest below its fair share.
Current workaround: Use the slave's --resources flag to statically reserve resources for your framework upon slave startup.
As for your specific use case and questions:
1a) How would one organize it? You could do this with Marathon, perhaps creating a separate Marathon instance for your stateful services, so that you can create static reservations for the 'stateful' role, such that only the stateful Marathon will be guaranteed those resources.
1b) Constraint a server to a particular cluster node? You can do this easily in Marathon, constraining an application to a specific hostname, or any node with a specific attribute value (e.g. NFS_Access=true). See Marathon Constraints. If you only wanted to run your tasks on a specific set of nodes, you would only need to create the static reservations on those nodes. And if you need discoverability of those nodes, you should check out Mesos-DNS and/or Marathon's HAProxy integration.
1c) Use some distributed FS? The data replication provided by many distributed filesystems would guarantee that your data can survive the failure of any single node. Persisting to a DFS would also provide more flexibility in where you can schedule your tasks, although at the cost of the difference in latency between network and local disk. Mesos has built-in support for fetching binaries from HDFS uris, and many customers use HDFS for passing executor binaries, config files, and input data to the slaves where their tasks will run.
2) DRBD, MooseFS, GlusterFS, NFS, CephFS? I've heard of customers using CephFS, HDFS, and MapRFS with Mesos. NFS would seem an easy fit too. It really doesn't matter to Mesos what you use as long as your task knows how to access it from whatever node where it's placed.
Hope that helps!

How does one create a cluster from different servers in MATLAB?

I have servers A, B & C with 8 cores each. Right now, I'm running jobs in parallel on a single server with 8 workers. Is there any way I can harness the power of all three for a single job? The servers are all inter-accessible via ssh (all three exist behind a gateway, so no password required either)
I'm going to assume you're currently using Parallel Computing Toolbox. To use multiple servers together, you need the following things:
MATLAB Distributed Computing Server licences for the MATLAB workers running on the machines.
Some sort of scheduler to schedule the jobs across the machines. MDCS comes with a basic scheduler, call the "Jobmanager". There are also various freely available schedulers for Linux systems such as Torque.
The installation instructions for MDCS are quite detailed and will lead you through all the stages you need to complete to get parallel jobs running.