Slurm - run more than one array on the same node? - matlab

I am running around 200 matlab codes on a slurm cluster. The codes are not parallelized but use intensive vectorized notation. So each code uses around 5-6 cores of processing power.
The sbatch code I am using is below:
#!/bin/bash
#SBATCH --job-name=sdmodel
#SBATCH --output=logs/out/%a
#SBATCH --error=logs/err/%a
#SBATCH --nodes=1
#SBATCH --partition=common
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --array=1-225
module load Matlab/R2021a
matlab -nodisplay -r "run('main_cluster2.m'); exit"
Now the code above will assign one cluster node to each matlab task (225 of such tasks). However some cluster nodes have 20 or more cores. Which means that I could efficiently use one node to run 3 or 4 tasks simultaneously. Is there anyway to modify the above code to do so?

Provided the cluster is configured to allow node sharing, you can remove the line #SBATCH --exclusive which requests that a full node be allocated to each job in the array and replace it with
SBATCH --cpus-per-task=5
to request 5 CPUs on the same node for each job in the array.
On a 20-core node, Slurm will be able to place 4 such jobs.

If node sharing is not allowed, then you should be able to use multiple srun commands in the script to subdivide the node. If you wanted to use 4 cores per task (on a 20 core node) then your script would then change to something like:
#!/bin/bash
#SBATCH --job-name=sdmodel
#SBATCH --output=logs/out/%a
#SBATCH --error=logs/err/%a
#SBATCH --nodes=1
#SBATCH --partition=common
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --array=1-225
module load Matlab/R2021a
for i in $(seq 1 5)
do
srun --ntasks=4 --exact matlab -nodisplay -r "run('main_cluster2.m'); exit" &
done
wait
The "&" at the end of each srun command puts the command into the background so you can skip onto launching multiple copies. The wait at the end makes sure the script waits for all backgrounded processes to finish before exiting.
Note, this may lead to wasted resources if each of the individual matlab commands take very different amounts of time as some runs will finish before others, leaving cores idle.

Related

qsub -t job "has died"

I am trying to submit an array job to our cluster using qsub. The script is like:
#!/bin/bash
#PBS -l nodes=1:ppn=1 # Number of nodes and processor
#..... (Other options)
#PBS -t 0-50 # List job
cd $PBS_O_WORKDIR
./programname << EOF
some parameters
EOF
This script runs without a problem when removing -t option. But every time I added -t, I got following output:
---------------------------------------------
Check nodes and clean them of stray processes
---------------------------------------------
Checking node XXXXXXXXXX
-> User XXXX running job XXXXX.XXX:state=X:ncpus=X
-> Job XXX.XXX has died
Done clearing all the allocated nodes
------------------------------------------------------
Concluding PBS prologue script - XX-XX-XXXX XX:XX:XX
------------------------------------------------------
-------------- Job will be requeued --------------
Where it died and started requeue. No error message. I did not find any similar issue online. Has anyone experienced this before? Thank you!
(I wrote another "manual" array qsub script which works. But I do wish to get the work, as it is in the command option and much cleaner.)

How to avoid running code on head node of the cluster

I am using a cluster to run my code. I use a runm file to run my code on the cluster. runm script is as below:
#!/bin/sh
#SBATCH --job-name="....."
#SBATCH -n 4
#SBATCH --output=bachoutput
#SBATCH --nodes=1-1
#SBATCH -p all
#SBATCH --time=1-01:00:00
matlab <znoDisplay.m>o1
today, when my code was running I received an email from cluster boss which says please don't run your code on the head node and use other nodes. I did a lot of searches but I couldn't find that how can I have changed the node from the main node to other nodes. Could anyone help me? Is there any script which could be used in runm to change it?
Could anyone help me to avoid running my code on head node?
If the Matlab process was running on the head node, it means you did not submit your script but you most probably simply ran it.
Make sure to submit it with
sbatch runm
Then you can see it waiting in the queue (or running) with
squeue -u $USER
and check that it is not running on the frontend with
top
Also note #atru's comment about the Matlab options -nodisplay and -nosplash for Matlab to work properly in batch mode.

How to iterate over a sequence in systemd?

We're migrating from ubuntu 14 to ubuntu 16.
I have the following upstart task:
description "Start multiple Resque workers."
start on runlevel [2345]
task
env NUM_WORKERS=4
script
for i in `seq 1 $NUM_WORKERS`
do
start resque-work-app ID=$i
done
end script
As you can see, I have 4 workers that I'm starting. There is an upstart script then that starts each one of these workers:
description "Resque work app"
respawn
respawn limit 5 30
instance $ID
pre-start script
test -e /home/opera/bounties || { stop; exit 0; }
end script
exec sudo -u opera sh -c "<DO THE WORK>"
How do I do something similar in systemd? I'm particularly interested in how to iterate over a sequence of 4, and start a worker for each - this way, I'd have a cluster of 4 workers.
systemd doesn't have an iteration syntax, but it still has features to help solve this problem. The related concepts that systemd provides are:
Target Units, which allow you to treat a related group of services as a single service.
Template Units, which allow you to easily launch new copies of an app based on a variable like an ID.
With systemd, you could run a one-time bash loop as part of setting up the service that would enable the desired number of workers:
for i in `seq 1 4`; { systemctl enable resque-work-app#1; }
That presumes you have a resque-work-app#.service file that includes something like:
[Install]
WantedBy=resque-work-app.target
And that you have have a resque-work-app.target that contains something like:
[Unit]
Description=Resque Work App
[Install]
WantedBy=multi-user.target
See Also
How to create a virtual systemd service to stop/start several instances together?
man systemd.target
man systemd.unit
About Instances and Template Units

GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node

I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this scenario.
When I run a PBS script calling GNU parallel without the --jobs option, like this:
#PBS -lnodes=2:ppn=2
...
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
it looks like it only uses one CPU per core, and also provides the following error stream:
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles087 (). Using 1.
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles108 (). Using 1.
This looks like one error for each node. I don't understand the first part (bash: parallel: command not found), but the second part tells me it's using one node.
When I add the option -j2 to the parallel call, the errors go away, and I think that it's using two CPUs per node. I am still a newbie to HPC, so my way of checking this is to output date-time stamps from my code (the dummy matlab code takes 10's of seconds to complete). My questions are:
Am I using the --jobs option correctly? Is it correct to specify -j2 because I have 2 CPUs per node? Or should I be using -jN where N is the total number of CPUs (number of nodes multiplied by number of CPUs per node)?
It appears that GNU parallel attempts to determine the number of CPUs per node on it's own. Is there a way that I can make this work properly?
Is there any meaning to the bash: parallel: command not found message?
Yes: -j is the number of jobs per node.
Yes: Install 'parallel' in your $PATH on the remote hosts.
Yes: It is a consequence from parallel missing from the $PATH.
GNU Parallel logs into the remote machine; tries to determine the number of cores (using parallel --number-of-cores) which fails and then defaults to 1 CPU core per host. By giving -j2 GNU Parallel will not try to determine the number of cores.
Did you know that you can also give the number of cores in the --sshlogin as: 4/myserver ? This is useful if you have a mix of machines with different number of cores.
This is not an answer to the 3 primary questions, but I'd like to point out some other problems with the parallel statement in the first code block.
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
The shell expands the $PBS_O_WORKDIR prior to executing parallel. This means two things happen (1) the --env sees a filename rather than an environment variable name and essentially does nothing and (2) expands as part command string eliminating the need to pass $PBS_O_WORKDIR which is why there wasn't an error.
The latest version of parallel 20151022 has a workdir option (although the tutorial lists it as alpha testing) which is probably the easiest solution. The parallel command line would look something like:
parallel --workdir $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodisplay -r "primes1({})" :::: 10 20 30 40
Final note, PBS_NODEFILE may contain hosts listed multiple times if more than one processor is requested by qsub. This many have implications for number of jobs run, etc.

Submission of Scala code to a cluster

Is it possible to run some akka code on the oracle grid engine with the use of multiple nodes?
So if I use the actor-model, which is a "message-passing model", is it possible to use Scala and the akka framework to run my code on a distributed memory system like a cluster or a grid?
If so, is there something similar like mpirun in mpi -c, to run my program on different nodes? Can you give a submission example using oracle grid engine?
how do I know inside scala on which node am I and to how many nodes the job has been submitted?
Is it possible to communicate with other nodes through the actor-model?
mpirun or (mpiexec on some systems) can run any kind of executables (even if they don't use MPI). I currently use it to launch java and scala codes on clusters. It may be tricky to pass arguments to the executable when calling mpirun so you could use an intermediate script.
We use Torque/Maui scripts which are not compatible with GridEngine, but here is a script my colleague is currently using:
#!/bin/bash
#PBS -l walltime=24:00:00
#PBS -l nodes=10:ppn=1
#PBS -l pmem=45gb
#PBS -q spc
# Find the list of nodes in the cluster
id=$PBS_JOBID
nodes_fn="${id}.nodes"
# Config file
config_fn="human_stability_article.conf"
# Java command to call
java_cmd="java -Xmx10g -cp akka/:EvoProteo-assembly-0.0.2.jar ch.unige.distrib.BuildTree ${nodes_fn} ${config_fn} ${id}"
# Create a small script to pass properly the parameters
aktor_fn="./${id}_aktor.sh"
echo -e "${java_cmd}" >> $aktor_fn
# Copy the machine file to the proper location
rm -f $nodes_fn
cp $PBS_NODEFILE $nodes_fn
# Launch the script on 10 notes
mpirun -np 10 sh $aktor_fn > "${id}_human_stability_out.txt"