I have a setup consisting from 3 workers and a management node, which I use for submitting tasks. I would like to execute concurrently a setup script at all workers:
bsub -q queue -n 3 -m 'h0 h1 h2' -J "%J_%I" mpirun setup.sh
As far as I understand, I could use 'ptile' resource constraint to force execution at all workers:
bsub -q queue -n 3 -m 'h0 h1 h2' -J "%J_%I" -R 'span[ptile=1]' mpirun setup.sh
However, occasionally I face an issue that my script got executed several times at the same worker.
Is it expected behavior? Or there is a bug in my setup? Is there a better way for enforcing multi worker execution?
Your understanding of span[ptile=1] is correct. LSF will only use 1 core per host for your job. If there aren't enough hosts based on the -n then the job will pend until something frees up.
However, occasionally I face an issue that my script got executed
several times at the same worker.
I suspect that its something with your script. e.g., LSF appends to the stdout file by default. Use -oo to overwrite.
Related
I am running a script multiple times over a multinode cluster, and this script processes data sequentially over the cluster. Here is the code:
import os
os.system("srun -p rs2 --mem-per-cpu 200G -t 7-23:00:00 python3 /home/usr/Sim/sim.py aok; srun -p rs2 python3 boguspython.py")
The issue is that if the first statement i.e.
srun -p rs2 --mem-per-cpu 200G -t 7-23:00:00 python3 /home/usr/Sim/sim.py aok
needs to wait for an allocation of resources, then the second statement immediately executes. The second command I have relies on the first command being fully executed. Is there a way to make the second statement wait until the first statement allocates and fully finishes?
I would suggest using subprocess.Popen instead of os.system. This will let you use .wait(), where you can check the status of a command for completion.
Got stuck for a while using custom scheduler for celery beats:
celery -A myapp worker -S redbeat.RedBeatScheduler -E -B -l info
My though was that this would launch both celery worker and celery beats using the redbeat.RedBeatScheduler as its scheduler. It even says beat: Staring.., however it does not use the specified scheduler apparently. No cron tasks are executed like this.
However when I split this command into separate worker and beats, that means
celery -A myapp worker -E -l info
celery -A myapp beat -S redbeat.RedBeatScheduler
Everything works as expected.
Is there any way to merge those two commands?
I do not think the Celery worker has the -S parameter like beat does. Here is what --helps says:
--scheduler SCHEDULER
Scheduler class to use. Default is
celery.beat.PersistentScheduler
So I suggest you use the --scheduler option and run celery -A myapp worker --scheduler redbeat.RedBeatScheduler -E -B -l info instead.
I planed to starting emacs from the the start.sh as
$ head start.sh
#! /bin/bash
{
#starting emacs servers
emacs --daemon=orging
emacs --daemon=coding
#waiting...
#invoke emacsclients
emacsclient -c -s "orging" &
emacsclient -c -s "coding" &
......
} &> /dev/null
Two clients run respectively under servers of orging and coding.
A problem occurred to this situation is that the invoked running clients are not labelled with appropriate server names.
So a manual steps of testing might be need to determine who is who.
As an alternative, the servers could be scheduled with one running at the top , the other at the end after starting from in the start.sh,
How could determine which server a client attached in a straightforward way on a working frame?
You can inspect the variable server-name - interactively with C-h v server-name RET.
Hi great people of stackoverflow,
Were hosting a docker container on EB with an nodejs based code running on it.
When redeploying our docker container we'd like the old one to do a graceful shutdown.
I've found help & guides on how our code could receive a sigterm signal produced by 'docker stop' command.
However further investigation into the EB machine running docker at:
/opt/elasticbeanstalk/hooks/appdeploy/enact/01flip.sh
shows that when "flipping" from current to the new staged container, the old one is killed with 'docker kill'
Is there any way to change this behaviour to docker stop?
Or in general a recommended approach to handling graceful shutdown of the old container?
Thanks!
Self answering as I've found a solution that works for us:
tl;dr: use .ebextensions scripts to run your script before 01flip, your script will make sure a graceful shutdown of whatevers inside the docker takes place
first,
your app (or whatever your'e running in docker) has to be able to catch a signal, SIGINT for example, and shutdown gracefully upon it.
this is totally unrelated to Docker, you can test it running wherever (locally for example)
There is a lot of info about getting this kind of behaviour done for different kind of apps on the net (be it ruby, node.js etc...)
Second,
your EB/Docker based project can have a .ebextensions folder that holds all kinda of scripts to execute while deploying.
we put 2 custom scripts into it, gracefulshutdown_01.config and gracefulshutdown_02.config file that looks something like this:
# gracefulshutdown_01.config
commands:
backup-original-flip-hook:
command: cp -f /opt/elasticbeanstalk/hooks/appdeploy/enact/01flip.sh /opt/elasticbeanstalk/hooks/appdeploy/01flip.sh.bak
test: '[ ! -f /opt/elasticbeanstalk/hooks/appdeploy/01flip.sh.bak ]'
cleanup-custom-hooks:
command: rm -f 05gracefulshutdown.sh
cwd: /opt/elasticbeanstalk/hooks/appdeploy/enact
ignoreErrors: true
and:
# gracefulshutdown_02.config
commands:
reorder-original-flip-hook:
command: mv /opt/elasticbeanstalk/hooks/appdeploy/enact/01flip.sh /opt/elasticbeanstalk/hooks/appdeploy/enact/10flip.sh
test: '[ -f /opt/elasticbeanstalk/hooks/appdeploy/enact/01flip.sh ]'
files:
"/opt/elasticbeanstalk/hooks/appdeploy/enact/05gracefulshutdown.sh":
mode: "000755"
owner: root
group: root
content: |
#!/bin/sh
# find currently running docker
EB_CONFIG_DOCKER_CURRENT_APP_FILE=$(/opt/elasticbeanstalk/bin/get-config container -k app_deploy_file)
EB_CONFIG_DOCKER_CURRENT_APP=""
if [ -f $EB_CONFIG_DOCKER_CURRENT_APP_FILE ]; then
EB_CONFIG_DOCKER_CURRENT_APP=`cat $EB_CONFIG_DOCKER_CURRENT_APP_FILE | cut -c 1-12`
echo "Graceful shutdown on app container: $EB_CONFIG_DOCKER_CURRENT_APP"
else
echo "NO CURRENT APP TO GRACEFUL SHUTDOWN FOUND"
exit 0
fi
# give graceful kill command to all running .js files (not stats!!)
docker exec $EB_CONFIG_DOCKER_CURRENT_APP sh -c "ps x -o pid,command | grep -E 'workers' | grep -v -E 'forever|grep' " | awk '{print $1}' | xargs docker exec $EB_CONFIG_DOCKER_CURRENT_APP kill -s SIGINT
echo "sent kill signals"
# wait (max 5 mins) until processes are done and terminate themselves
TRIES=100
until [ $TRIES -eq 0 ]; do
PIDS=`docker exec $EB_CONFIG_DOCKER_CURRENT_APP sh -c "ps x -o pid,command | grep -E 'workers' | grep -v -E 'forever|grep' " | awk '{print $1}' | cat`
echo TRIES $TRIES PIDS $PIDS
if [ -z "$PIDS" ]; then
echo "finished graceful shutdown of docker $EB_CONFIG_DOCKER_CURRENT_APP"
exit 0
else
let TRIES-=1
sleep 3
fi
done
echo "failed to graceful shutdown, please investigate manually"
exit 1
gracefulshutdown_01.config is a small util that backups the original flip01 and deletes (if exists) our custom script.
gracefulshutdown_02.config is where the magic happens.
it creates a 05gracefulshutdown enact script and makes sure flip will happen afterwards by renaming it to 10flip.
05gracefulshutdown, the custom script, does this basically:
find current running docker
find all processes that need to be sent a SIGINT (for us its processes with 'workers' in its name
send a sigint to the above processes
loop:
check if processes from before were killed
continue looping for an amount of tries
if tries are over, exit with status "1" and dont continue to 10flip, manual interference is needed.
this assumes you only have 1 docker running on the machine, and that you are able to manually hop on to check whats wrong in the case it fails (for us never happened yet).
I imagine it can also be improved in many ways, so have fun.
I started a Dancer/Starman server using:
sudo plackup -s Starman -p 5001 -E deployment --workers=10 -a mywebapp/bin/app.pl
but I'm unsure how I can stop the server. Can someone provide me with a quick way of stopping it and all the workers it has spawned?
Use the
--pid /path/to/the/pid.file
and you can kill the process based on his PID
So, using the above options, you can use
kill $(cat /path/to/the/pid.file)
the pid.file simply stores the master's PID - don't need analyze the ps output...
pkill -f starman
Kill processes based on name.
On Windows you can do "CTRL + C" like making a copy but Cancel in this case. Tested working.