I am using a cluster to run my code. I use a runm file to run my code on the cluster. runm script is as below:
#!/bin/sh
#SBATCH --job-name="....."
#SBATCH -n 4
#SBATCH --output=bachoutput
#SBATCH --nodes=1-1
#SBATCH -p all
#SBATCH --time=1-01:00:00
matlab <znoDisplay.m>o1
today, when my code was running I received an email from cluster boss which says please don't run your code on the head node and use other nodes. I did a lot of searches but I couldn't find that how can I have changed the node from the main node to other nodes. Could anyone help me? Is there any script which could be used in runm to change it?
Could anyone help me to avoid running my code on head node?
If the Matlab process was running on the head node, it means you did not submit your script but you most probably simply ran it.
Make sure to submit it with
sbatch runm
Then you can see it waiting in the queue (or running) with
squeue -u $USER
and check that it is not running on the frontend with
top
Also note #atru's comment about the Matlab options -nodisplay and -nosplash for Matlab to work properly in batch mode.
Related
I am running around 200 matlab codes on a slurm cluster. The codes are not parallelized but use intensive vectorized notation. So each code uses around 5-6 cores of processing power.
The sbatch code I am using is below:
#!/bin/bash
#SBATCH --job-name=sdmodel
#SBATCH --output=logs/out/%a
#SBATCH --error=logs/err/%a
#SBATCH --nodes=1
#SBATCH --partition=common
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --array=1-225
module load Matlab/R2021a
matlab -nodisplay -r "run('main_cluster2.m'); exit"
Now the code above will assign one cluster node to each matlab task (225 of such tasks). However some cluster nodes have 20 or more cores. Which means that I could efficiently use one node to run 3 or 4 tasks simultaneously. Is there anyway to modify the above code to do so?
Provided the cluster is configured to allow node sharing, you can remove the line #SBATCH --exclusive which requests that a full node be allocated to each job in the array and replace it with
SBATCH --cpus-per-task=5
to request 5 CPUs on the same node for each job in the array.
On a 20-core node, Slurm will be able to place 4 such jobs.
If node sharing is not allowed, then you should be able to use multiple srun commands in the script to subdivide the node. If you wanted to use 4 cores per task (on a 20 core node) then your script would then change to something like:
#!/bin/bash
#SBATCH --job-name=sdmodel
#SBATCH --output=logs/out/%a
#SBATCH --error=logs/err/%a
#SBATCH --nodes=1
#SBATCH --partition=common
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --array=1-225
module load Matlab/R2021a
for i in $(seq 1 5)
do
srun --ntasks=4 --exact matlab -nodisplay -r "run('main_cluster2.m'); exit" &
done
wait
The "&" at the end of each srun command puts the command into the background so you can skip onto launching multiple copies. The wait at the end makes sure the script waits for all backgrounded processes to finish before exiting.
Note, this may lead to wasted resources if each of the individual matlab commands take very different amounts of time as some runs will finish before others, leaving cores idle.
I am trying to submit an array job to our cluster using qsub. The script is like:
#!/bin/bash
#PBS -l nodes=1:ppn=1 # Number of nodes and processor
#..... (Other options)
#PBS -t 0-50 # List job
cd $PBS_O_WORKDIR
./programname << EOF
some parameters
EOF
This script runs without a problem when removing -t option. But every time I added -t, I got following output:
---------------------------------------------
Check nodes and clean them of stray processes
---------------------------------------------
Checking node XXXXXXXXXX
-> User XXXX running job XXXXX.XXX:state=X:ncpus=X
-> Job XXX.XXX has died
Done clearing all the allocated nodes
------------------------------------------------------
Concluding PBS prologue script - XX-XX-XXXX XX:XX:XX
------------------------------------------------------
-------------- Job will be requeued --------------
Where it died and started requeue. No error message. I did not find any similar issue online. Has anyone experienced this before? Thank you!
(I wrote another "manual" array qsub script which works. But I do wish to get the work, as it is in the command option and much cleaner.)
We're migrating from ubuntu 14 to ubuntu 16.
I have the following upstart task:
description "Start multiple Resque workers."
start on runlevel [2345]
task
env NUM_WORKERS=4
script
for i in `seq 1 $NUM_WORKERS`
do
start resque-work-app ID=$i
done
end script
As you can see, I have 4 workers that I'm starting. There is an upstart script then that starts each one of these workers:
description "Resque work app"
respawn
respawn limit 5 30
instance $ID
pre-start script
test -e /home/opera/bounties || { stop; exit 0; }
end script
exec sudo -u opera sh -c "<DO THE WORK>"
How do I do something similar in systemd? I'm particularly interested in how to iterate over a sequence of 4, and start a worker for each - this way, I'd have a cluster of 4 workers.
systemd doesn't have an iteration syntax, but it still has features to help solve this problem. The related concepts that systemd provides are:
Target Units, which allow you to treat a related group of services as a single service.
Template Units, which allow you to easily launch new copies of an app based on a variable like an ID.
With systemd, you could run a one-time bash loop as part of setting up the service that would enable the desired number of workers:
for i in `seq 1 4`; { systemctl enable resque-work-app#1; }
That presumes you have a resque-work-app#.service file that includes something like:
[Install]
WantedBy=resque-work-app.target
And that you have have a resque-work-app.target that contains something like:
[Unit]
Description=Resque Work App
[Install]
WantedBy=multi-user.target
See Also
How to create a virtual systemd service to stop/start several instances together?
man systemd.target
man systemd.unit
About Instances and Template Units
Is it possible to run some akka code on the oracle grid engine with the use of multiple nodes?
So if I use the actor-model, which is a "message-passing model", is it possible to use Scala and the akka framework to run my code on a distributed memory system like a cluster or a grid?
If so, is there something similar like mpirun in mpi -c, to run my program on different nodes? Can you give a submission example using oracle grid engine?
how do I know inside scala on which node am I and to how many nodes the job has been submitted?
Is it possible to communicate with other nodes through the actor-model?
mpirun or (mpiexec on some systems) can run any kind of executables (even if they don't use MPI). I currently use it to launch java and scala codes on clusters. It may be tricky to pass arguments to the executable when calling mpirun so you could use an intermediate script.
We use Torque/Maui scripts which are not compatible with GridEngine, but here is a script my colleague is currently using:
#!/bin/bash
#PBS -l walltime=24:00:00
#PBS -l nodes=10:ppn=1
#PBS -l pmem=45gb
#PBS -q spc
# Find the list of nodes in the cluster
id=$PBS_JOBID
nodes_fn="${id}.nodes"
# Config file
config_fn="human_stability_article.conf"
# Java command to call
java_cmd="java -Xmx10g -cp akka/:EvoProteo-assembly-0.0.2.jar ch.unige.distrib.BuildTree ${nodes_fn} ${config_fn} ${id}"
# Create a small script to pass properly the parameters
aktor_fn="./${id}_aktor.sh"
echo -e "${java_cmd}" >> $aktor_fn
# Copy the machine file to the proper location
rm -f $nodes_fn
cp $PBS_NODEFILE $nodes_fn
# Launch the script on 10 notes
mpirun -np 10 sh $aktor_fn > "${id}_human_stability_out.txt"
My server is using a GRID. we have 3 nodes [any one of them could execute my script when i kick off the autosys job ]
Now my problem is if am trying to stop a job from running if it is already running. My code works when i see the scripts is executing on the same node [i mean the first instance and the second instance ]
ps -ead -o %U%p%a| egrep '(ksh|perl)' | grep -v egrep| grep \"perl .*myprocess.pl\"
is there a way, PS could list all instances of the processes from all nodes in the GRID.
please help!!
You can create a start.flag file in a common location. Keep the below conditions:
if the flag exists, then the flag will be removed and the
script will be executed. After completion of execution, the script
will touch that flag again.
if the flag does not exist, the script will just exit saying that its
running.
Best luck :)