Preferred approach for running one script over multiple directories in SLURM - hpc

My most typical use case is running a single script over multiple directories (usually R or Matlab). I have access to a high-performance computing environment (SLURM-based). From my research so far, it is unclear to me which of the following approaches would be preferred to make most efficient use of the CPUs/cores available. I also want to make sure I'm not unnecessarily taking up system resources so I'd like to double check which of the following two approaches is most suitable.
Approach 1:
Parallelize code within the script (MPI).
Wrap this in a loop that applies the script to all directories.
Submit this as a single MPI job as a SLURM script.
Approach 2:
Parallelize code within the script (MPI).
Create an MPI job array, one job per directory, each running the script on the directory.
I'm still new to this so if I've mixed up something here or you need more details to answer the question please let me know.

If you do not explicitly use MPI inside your original R or Matlab script, I suggest you avoid using MPI at all and use job arrays.
Assuming you have a script myscript.R and a set of subdirectories data01, data02, ..., data10, and the scripts takes the name of the directory as input parameter, you can do the following.
Create a submission script in the directory parent of the data directories:
#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --mem-per-cpu=2G
#SBATCH --time 1-0
#SBATCH --array=1-10
DIRS=(data*/) # Create a Bash array with all data directories
module load R
Rscript myscript.R ${DIRS[$SLURM_ARRAY_TASK_ID]} # Feed the script with the data directory
# corresponding to the task ID in the array
This script will create a job array where each job will run the myscript.R with one of the data directories as argument.
Of course you will need to adapt the values of the memory and time, and investigate whether or not using more than one CPU per job is beneficial in your case. And adapt the --array parameter to the actual number of directories in your case.

Answer is quite obvious to me, assuming that getting good paralelism is usually difficult.
In the first approach, you ask SLURM for a set of resources, but if you ask for many CPUs, you will probably waste quite a lot of resources (if you ask for 32 CPU and your speedup is only 4x you are wasting 28 CPUs). So you will go with a small portion of the cluster processing one folder after the other.
In the second approach, you will ask SLURM to run a job per every folder. There will be many jobs running simultaneously and they can ask for fewer resources. Say you ask for 4 CPUs per job (and the speedup is 3X, which means you waste 1 cpu per job). Running 8 jobs simultaneously will take the same 32 CPUs of first approach, but only 8 CPUs will be wasted and 8 folders will be processed simultaneously.
At the end, the decision have to be taken after seeing which is the speedup with the different number of CPUs, but my feeling is that second approach will be generally preferred, unless you get a very good speedup, in which case, both approaches are equivalent.

Related

Running NetLogo headless on HPC, how to increase CPU usage?

I was running NetLogo headless on HPC using behaviourspace. Some non-NetLogo other user on the HPC complained to me that I am not utilizing the CPU cores to very little extent and should increase. I don't know exactly know to how to do so, please help. I am guessing renice won't be of any help.
Code:
#!/bin/bash
#$ -N NewPara3-d
#$ -q all.q
#$ -pe mpi 30
/home/abhishekb/netlogo/netlogo-5.1.0/netlogo-headless.sh \
--model /home/abhishekb/models/Test_results3-d.nlogo \
--experiment 3-d \
--table /home/abhishekb/csvresults/Test_results3-d.csv
In comments you link your related question where you're trying to use linux process priority to make jobs run faster / use more CPU
There you ask
CQLOAD (what does it mean too?)
The docs for this are hard to find, but you link to the spec of your cluster, which tells us that the scheduling engine for it is Sun's *Grid Engine". Man pages are here (you can access them locally too - in particular typing man qstat)
If you search through for qstat -g c, you will see the outputs described. In particular, the second column (CQLOAD) is described as:
OUTPUT FORMATS
...
an average of the normalized load average of all queue
hosts. In order to reflect each hosts different signifi-
cance the number of configured slots is used as a weight-
ing factor when determining cluster queue load. Please
note that only hosts with a np_load_value are considered
for this value. When queue selection is applied only data
about selected queues is considered in this formula. If
the load value is not available at any of the hosts '-
NA-' is printed instead of the value from the complex
attribute definition.
This means that CQLOAD gives an indication of how utilized the processors are in the queue. Your output shows 0.84: the average load on processors in all.q is 84%. This doesn't seem too low.
You state colleagues are complaining that your processes are not using enough CPU. I'm not sure what that's based on, but I wonder if it's just because you're using a lot of nodes (even if just for a short time).
You might want to experiment with using fewer nodes (unless your results are very slow) - that is achieved by altering the line #$ -pe mpi 30 - maybe take the number 30 down. You can work out how many nodes you need (roughly) by timing how long 1 model run takes on your computer and then use
N = (time to run 1 job) * number of runs in experiment) / time you want the run to take
I'm not an expert but the Scheduler on the cluser seems to be supported in OpenMole.
OpenMole is a nice solution for Embed your NetLogo model transparently on many environnements. It can be on solution ...

Why do my results become un-reproducible when run from qsub?

I'm running matlab on a cluster. when i run my .m script from an interactive matlab session on the cluster my results are reproducible. but when i run the same script from a qsub command, as part of an array job away from my watchful eye, I get believable but unreproducible results. The .m files are doing exactly the same thing, including saving the results as .mat files.
Anyone know why run one way the scripts give reproducible results, and run the other way they become unreproducible?
Is this only a problem with reproducibility or is this indicative of inaccurate results?
%%%%%
Thanks to spuder for a helpful answer. Just in case anyone stumbles upon this and is interested here is some further information.
If you use more than one thread in Matlab jobs, this may result in stealing resources from other jobs which plays havoc with the results. So you have 2 options:
1. Select exclusive access to a node. The cluster I am using is not currently allowing parallel array jobs, so doing this for me was very wasteful - i took a whole node but used it in serial.
2. Ask matlab to run on a singleCompThread. This may make your script take longer to complete, but it gets jobs through the queue faster.
There are a lot of variables at play. Ruling out transient issues such as network performance, and load here are a couple of possible explanations:
You are getting assigned a different batch of nodes when you run an interactive job from when you use qsub.
I've seen some sites that assign a policy of 'exclusive' to the nodes that run interactive jobs and 'shared' to nodes that run queued 'qsub' jobs. If that is the case, then you will almost always see better performance on the exclusive nodes.
Another answer might be that the interactive jobs are assigned to nodes that have less network congestion.
Additionally, if you are requesting multiple nodes, and you happen to land on nodes that traverse multiple hops, then you could be seeing significant network slowdowns. The solution would be for the cluster administrator to setup nodesets.
Are you using multiple nodes for the job? How are you requesting the resources?

Perl Distributed parallel computing

I would like to know if there are any perl modules available to enable distributed parallel computation similar to apache hadoop.
Example,
A perl script to be executed in many machines parallely when submitted to a client node.
I'm the author of the Many-core Engine for Perl.
During the next several weekends, I will take MCE for a spin with Gearman::XS. MCE is good at maximizing available cores on a given node. Gearman is good at job distribution and includes a lot of features such as load balancing. Combining the two together was my thought for scaling MCE horizontally across many nodes. :) I did not share this news with anybody until just now.
Why are the two modules a good fit (my humble opinion):
For distribution, one needs some sort of chunking engine. MCE is a chunking engine -- so breaking up input is natural to MCE. Essentially MCE can be used at both sides, the job submission host as well as on the worker node for maximizing available cores.
For worker nodes, MCE follows a bank-queuing model when processing input data. This helps guarantee that all CPUs remain busy from the start of the job till the very end. As workers being to idle down, the remaining "working" are processing their very last chunk.
One's imagination is the limit here -- there are so many possibilities with these 2 modules working together. When writing MCE, I first focused on the node side. Job distribution is next obviously and I did a search and came across Gearman::XS. The 2 modules can chunk happily together :) Job distribution side (bigger chunks), once on node (smaller chunks). All the networking stuff is handled by Gearman.
Basically, there's no need for me to write the job distribution aspect when Gearman::XS is already quite nice. This has been my plan. I will write about Gearman::XS + MCE soon.
BTW: Folks can do similar things with GRID-Machine + MCE I imagine. MCE's beauty is on maximizing all available cores on any given node.
Another magical thing about MCE is that one may not want 200 nodes * 16 workers all reading/writing from/to the NFS server for example. That will impact the NFS server greatly. BTW: RHEL 6.4 will include pNFS (parallel NFS). With MCE, workers can call the "do" method to serialize writes/reads from NFS. So instead of 200 * 16 = 3200 attacking NFS, it becomes just 200 maximum requests against the NFS server at any given time (1 per physical node).
When writing MCE, grace can be applied for many scenarios. I need to add more wikis to MCE's home page MCE at code.google.com. In addition, MCE eats really big log files for breakfast :) Check out egrep.pl and wc.pl under the examples dir. It even beats the wide finder project with sequential IO (powerful slurp IO among many workers).
Check out the images included with the MCE distribution. Oh, do not forget to check out the main Gearman site as well.
What's left after this? Humm, the web piece. One idea which comes to mind is to use Mojo. There are many options. This is just one:
Gearman::XS + MCE + Mojolicious
Again, one can use GRID-Machine instead of Gearman::XS if wanting to communicate through SSH.
Anyway, that was my plan to use an already available job distribution module. For MCE, my focus was on maximizing performance on a single node -- to include chunking, serializing, bank-queuing model, user tasks (allows for many roles), number sequencing among workers, and sequential slurp IO.
-- mario
You might look into something as simple as a message queue like ZeroMQ. I'm sure a CPAN search could help with some other suggestions.
Recently there has been some talk of the Many Core Engine MCE module, which you might want to investigate, I don't know for sure that it lets you parallelize off the host computer, but it seems like it wouldn't be a big step given its stated purpose.
Argon may provide what you are looking for (disclaimer - I'm the author). It allows you to set up an arbitrary network of workers, each of which runs a process pool (using Coro::ProcessPool).
Creating a task is pretty simple:
use Argon::Client;
my $client = Argon::Client->new(host => "somehost", port => 8000);
my $result = $client->queue(sub {
use My::Work::Module qw(do_work);
my $task_id = shift;
do_work($task_id);
});
The GRID module on CPAN is designed for working with distributed computing.
https://metacpan.org/pod/distribution/GRID-Machine/lib/GRID/Machine/perlparintro.pod

Run time memory of perl script

I am having a perl script which is killed by a automated job whenever a high priority process comes as my script is running ~ 300 parallel jobs for downloading data and is consuming lot of memory. I want to figure out how much is the memory it takes during run time so that I can ask for more memory before scheduling the script or if I get to know using some tool the portion in my code which takes up more memory, I can optimize the code for it.
Regarding OP's comment on the question, if you want to minimize memory use, definitely collect and append the data one row/line at a time. If you collect all of it into a variable at once, that means you need to have all of it in memory at once.
Regarding the question itself, you may want to look into whether it's possible to have the Perl code just run once (rather than running 300 separate instances) and then fork to create your individual worker processes. When you fork, the child processes will share memory with the parent much more efficiently than is possible for unrelated processes, so you will, e.g., only need to have one copy of the Perl binary in memory rather than 300 copies.

Perl Threads faster than Sequentially processing?

Just wanted to ask whether it's true that parallel processing is faster than sequentially processing.
I've always thought that parallel processing is faster, so therefore, I did an experiment.
I benchmarked my scripts and found out that after doing a bunch of
sub add{
for ($x=0; $x<=200000; $x++){
$data[$x] = $x/($x+2);
}
}
threading seems to be slower by about 0.5 CPU secs on average. Is this normal or is it really true that sequentially processing is faster?
Whether parallel vs. sequential processing is better is highly task-dependent and you've already done the right thing: You benchmarked both and determined for your task (the one you benchmarked, not necessarily the one you actually want to do) which one is faster.
As a general rule, on a single processor, sequential processing tends to be better for tasks which are CPU-bound, because if you have two tasks each needing five seconds of CPU time to complete, then you'll need ten seconds of CPU time regardless of whether you do them sequentially or in parallel. Setting up multiple threads/processes will, therefore, provide no benefit, but it will create additional task-switching overhead while also preventing you from having any results until all results are available.
CPU-bound tasks on a multi-processor system tend to do better when run in parallel, provided that they can run independently of each other. If not, or if you're using a language/threading model/IPC model/etc. which forces all tasks to run on the same processor, then see "on a single processor" above.
Parallel processing is generally better for tasks which are I/O-bound, regardless of the number of processors available, because CPUs are fast and I/O is slow, so working in parallel allows one task to process its data while the other is waiting for I/O operations to complete. (This is why make -j2 tends to be significantly faster than a plain make, even on single-processor machines.)
But, again, these are all generalities and all have cases where they'll be incorrect. Only benchmarking will reveal the truth with certainty.
Perl threads are an extreme suck. You are better off in every case forking several processes.
When you create a new thread in perl, it does the following:
Make a copy - yes, a real copy - of every single perl data structure in scope, including those belonging to modules you didn't write
Start up what is almost a new, independent instance of perl in a new OS thread
If you then want to share anything (as it has now copied everything) you have to use the share function in the threads module. This is incredibly sucky, as it replaces your variable, with some tie() nonsense, which adds much-too-fine-grained locking around it to prevent concurrent access. Accessing a shared variable then causes a massive amount of implicit locking, and is incredibly slow.
So in short, perl threads:
Take a long time to start
waste loads of memory
Cannot share data efficiently anyway.
You are much better off with fork(), which does not copy every variable (the kernel does copy-on-write) unless you're on Windows.
There's no reason to assume that in a single CPU core system, parallel processing will be faster.
Consider this png example:
The red and blue lines at the top represent two tasks running sequentially on a single core.
The alternate red and blue lines at the bottom represent two task running in parallel on a single core.