Running NetLogo headless on HPC, how to increase CPU usage? - netlogo

I was running NetLogo headless on HPC using behaviourspace. Some non-NetLogo other user on the HPC complained to me that I am not utilizing the CPU cores to very little extent and should increase. I don't know exactly know to how to do so, please help. I am guessing renice won't be of any help.
Code:
#!/bin/bash
#$ -N NewPara3-d
#$ -q all.q
#$ -pe mpi 30
/home/abhishekb/netlogo/netlogo-5.1.0/netlogo-headless.sh \
--model /home/abhishekb/models/Test_results3-d.nlogo \
--experiment 3-d \
--table /home/abhishekb/csvresults/Test_results3-d.csv

In comments you link your related question where you're trying to use linux process priority to make jobs run faster / use more CPU
There you ask
CQLOAD (what does it mean too?)
The docs for this are hard to find, but you link to the spec of your cluster, which tells us that the scheduling engine for it is Sun's *Grid Engine". Man pages are here (you can access them locally too - in particular typing man qstat)
If you search through for qstat -g c, you will see the outputs described. In particular, the second column (CQLOAD) is described as:
OUTPUT FORMATS
...
an average of the normalized load average of all queue
hosts. In order to reflect each hosts different signifi-
cance the number of configured slots is used as a weight-
ing factor when determining cluster queue load. Please
note that only hosts with a np_load_value are considered
for this value. When queue selection is applied only data
about selected queues is considered in this formula. If
the load value is not available at any of the hosts '-
NA-' is printed instead of the value from the complex
attribute definition.
This means that CQLOAD gives an indication of how utilized the processors are in the queue. Your output shows 0.84: the average load on processors in all.q is 84%. This doesn't seem too low.
You state colleagues are complaining that your processes are not using enough CPU. I'm not sure what that's based on, but I wonder if it's just because you're using a lot of nodes (even if just for a short time).
You might want to experiment with using fewer nodes (unless your results are very slow) - that is achieved by altering the line #$ -pe mpi 30 - maybe take the number 30 down. You can work out how many nodes you need (roughly) by timing how long 1 model run takes on your computer and then use
N = (time to run 1 job) * number of runs in experiment) / time you want the run to take

I'm not an expert but the Scheduler on the cluser seems to be supported in OpenMole.
OpenMole is a nice solution for Embed your NetLogo model transparently on many environnements. It can be on solution ...

Related

Ballpark value for `--jobs` in `pg_restore` command

I'm using pg_restore to restore a database to its original state for load testing. I see it has a --jobs=number-of-jobs option.
How can I get a ballpark estimate of what this value should be on my machine? I know this is dependent on a bunch of factors, e.g. machine, dataset, but it would be great to get a conceptual starting point.
I'm on a MacBook Pro so maybe I can use the number of physical CPU cores:
sysctl -n hw.physicalcpu
# 10
If there is no concurrent activity at all, don't exceed the number of parallel I/O requests your disk can handle. This applies to the COPY part of the dump, which is likely I/O bound. But a restore also creates indexes, which uses CPU time (in addition to I/O). So you should also not exceed the number of CPU cores available.

Preferred approach for running one script over multiple directories in SLURM

My most typical use case is running a single script over multiple directories (usually R or Matlab). I have access to a high-performance computing environment (SLURM-based). From my research so far, it is unclear to me which of the following approaches would be preferred to make most efficient use of the CPUs/cores available. I also want to make sure I'm not unnecessarily taking up system resources so I'd like to double check which of the following two approaches is most suitable.
Approach 1:
Parallelize code within the script (MPI).
Wrap this in a loop that applies the script to all directories.
Submit this as a single MPI job as a SLURM script.
Approach 2:
Parallelize code within the script (MPI).
Create an MPI job array, one job per directory, each running the script on the directory.
I'm still new to this so if I've mixed up something here or you need more details to answer the question please let me know.
If you do not explicitly use MPI inside your original R or Matlab script, I suggest you avoid using MPI at all and use job arrays.
Assuming you have a script myscript.R and a set of subdirectories data01, data02, ..., data10, and the scripts takes the name of the directory as input parameter, you can do the following.
Create a submission script in the directory parent of the data directories:
#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --mem-per-cpu=2G
#SBATCH --time 1-0
#SBATCH --array=1-10
DIRS=(data*/) # Create a Bash array with all data directories
module load R
Rscript myscript.R ${DIRS[$SLURM_ARRAY_TASK_ID]} # Feed the script with the data directory
# corresponding to the task ID in the array
This script will create a job array where each job will run the myscript.R with one of the data directories as argument.
Of course you will need to adapt the values of the memory and time, and investigate whether or not using more than one CPU per job is beneficial in your case. And adapt the --array parameter to the actual number of directories in your case.
Answer is quite obvious to me, assuming that getting good paralelism is usually difficult.
In the first approach, you ask SLURM for a set of resources, but if you ask for many CPUs, you will probably waste quite a lot of resources (if you ask for 32 CPU and your speedup is only 4x you are wasting 28 CPUs). So you will go with a small portion of the cluster processing one folder after the other.
In the second approach, you will ask SLURM to run a job per every folder. There will be many jobs running simultaneously and they can ask for fewer resources. Say you ask for 4 CPUs per job (and the speedup is 3X, which means you waste 1 cpu per job). Running 8 jobs simultaneously will take the same 32 CPUs of first approach, but only 8 CPUs will be wasted and 8 folders will be processed simultaneously.
At the end, the decision have to be taken after seeing which is the speedup with the different number of CPUs, but my feeling is that second approach will be generally preferred, unless you get a very good speedup, in which case, both approaches are equivalent.

Why do my results become un-reproducible when run from qsub?

I'm running matlab on a cluster. when i run my .m script from an interactive matlab session on the cluster my results are reproducible. but when i run the same script from a qsub command, as part of an array job away from my watchful eye, I get believable but unreproducible results. The .m files are doing exactly the same thing, including saving the results as .mat files.
Anyone know why run one way the scripts give reproducible results, and run the other way they become unreproducible?
Is this only a problem with reproducibility or is this indicative of inaccurate results?
%%%%%
Thanks to spuder for a helpful answer. Just in case anyone stumbles upon this and is interested here is some further information.
If you use more than one thread in Matlab jobs, this may result in stealing resources from other jobs which plays havoc with the results. So you have 2 options:
1. Select exclusive access to a node. The cluster I am using is not currently allowing parallel array jobs, so doing this for me was very wasteful - i took a whole node but used it in serial.
2. Ask matlab to run on a singleCompThread. This may make your script take longer to complete, but it gets jobs through the queue faster.
There are a lot of variables at play. Ruling out transient issues such as network performance, and load here are a couple of possible explanations:
You are getting assigned a different batch of nodes when you run an interactive job from when you use qsub.
I've seen some sites that assign a policy of 'exclusive' to the nodes that run interactive jobs and 'shared' to nodes that run queued 'qsub' jobs. If that is the case, then you will almost always see better performance on the exclusive nodes.
Another answer might be that the interactive jobs are assigned to nodes that have less network congestion.
Additionally, if you are requesting multiple nodes, and you happen to land on nodes that traverse multiple hops, then you could be seeing significant network slowdowns. The solution would be for the cluster administrator to setup nodesets.
Are you using multiple nodes for the job? How are you requesting the resources?

What is the Overhead of matlabpool?

Could anyone point to me what is the overhead of running a matlabpool ?
I started a matlabpool :
matlabpool open 132procs 100
Starting matlabpool using the '132procs' configuration ... connected to 100 labs.
And followed cpu usage on the nodes as :
pdsh -A ps aux |grep dmlworker
When I launch the matlabpool, it starts with ~35% cpu usage on average and when the pool
is not being used it slowly (in 5-7 minutes) goes down to ~2% on average.
Is this normal ? What is the typical overhead ? Does that change if matlabpooljob is launched as a "batch" job ?
This is normal. ps aux reports the average CPU utilization since the process was started, not over a rolling window. This means that, although the workers initialize relatively quickly and then become idle, it will take longer for this to reflect in CPU%. This is different to the Linux top command, for example, which will reflect the utilization since the last screen update in %CPU.
As for typical overhead, this depends on a number of factors: clearly the number of workers, the rate and data size of jobs submitted (as well as in maintaining the worker processes, there is some overhead in marshalling input and output, which is not part of "useful computation"), whether the Matlab pool is local or attached to a job manager, and the Matlab version and O/S.
From experience, as a rough guide on a modern *nix server, I would think an idle worker should be not be consuming more than 20% of a single core (e.g. <~1% total CPU utilization on a 16-core box) after initilization, unless there is a configuration issue. I should not expect this to be influenced by what kind of jobs you are submitting (whether using "createJob" or "batch" or "parfor" for example): the workers and communication mechanisms underneath are essentially the same.

Does a modified command invocation tool – which dynamically regulates a job pool according to load – already exist?

Fellow Unix philosophers,
I programmed some tools in Perl that have a part which can run in parallel. I outfitted them with a -j (jobs) option like make and prove have because that's sensible. However, soon I became unhappy with that for two reasons.
I specify --jobs=2 because I have two CPU cores, but I should not need to tell the computer information that it can figure out by itself.
Rarely runs of the tool occupy more than 20% CPU (I/O load is also little), wasting time by not utilising CPU to a better extent.
I hacked some more to add a load measuring, spawning additional jobs while there's still »capacity« until a load threshold is reached, this is when the number of jobs stays more or less constant, but when during the course of a run other processes with higher priority are in demand of more CPU, over time less new jobs are spawned and accordingly the number of jobs reduces.
Since this responsibility was repeated code in the tools, I factored out the scheduling aspect into a stand-alone tool following the spirit of nice et al.. The parallel tools are quite dumb now, they only have signal handlers through which they are told to increase or decrease the jobs pool, whereas the intelligence of load measuring and figuring out when to control the pool resides in the scheduler.
Taste of the tentative interface (I also want to provide sensible defaults so options can be omitted):
run-parallel-and-schedule-job-pool \
--cpu-load-threshold=90% \
--disk-load-threshold='300 KiB/s' \
--network-load-threshold='1.2 MiB/s' \
--increase-pool='/bin/kill -USR1 %PID' \
--decrease-pool='/bin/kill -USR2 %PID' \
-- \
parallel-something-master --MOAR-OPTIONS
Before I put effort into the last 90%, do tell me, am I duplicating someone else's work? The concept is quite obvious, so it seems it should have been done already, but I couldn't find this as a single responsibility stand-alone tool, only as deeply integrated part of larger many-purpose sysadmin suites.
Bonus question: I already know runN and parallel. They do parallel execution, but do not have the dynamic scheduling (niceload goes into that territory, but is quite primitive). If against my expectations the stand-alone tool does not exists yet, am I better off extending runN myself or filing a wish against parallel?
some of our users are quite happy with condor. It is a system for dynamically distributing jobs to other workstations and servers according to their free computing resources.