Celery doesn't restart subprocesses - celery

I have an issue with celery deployment - when I restart it old subprocesses don't stop and continue to process some of jobs. I use supervisord to run celery. Here is my config:
$ cat /etc/supervisor/conf.d/celery.conf
[program:celery]
; Full path to use virtualenv, honcho to load .env
command=/home/ubuntu/venv/bin/honcho run celery -A stargeo worker -l info --no-color
directory=/home/ubuntu/app
environment=PATH="/home/ubuntu/venv/bin:%(ENV_PATH)s"
user=ubuntu
numprocs=1
stdout_logfile=/home/ubuntu/logs/celery.log
stderr_logfile=/home/ubuntu/logs/celery.err
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600
; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
killasgroup=true
; if rabbitmq is supervised, set its priority higher
; so it starts first
priority=998
Here is how celery processes look:
$ ps axwu | grep celery
ubuntu 983 0.0 0.1 47692 10064 ? S 11:47 0:00 /home/ubuntu/venv/bin/python /home/ubuntu/venv/bin/honcho run celery -A stargeo worker -l info --no-color
ubuntu 984 0.0 0.0 4440 652 ? S 11:47 0:00 /bin/sh -c celery -A stargeo worker -l info --no-color
ubuntu 985 0.0 0.5 168720 41356 ? S 11:47 0:01 /home/ubuntu/venv/bin/python /home/ubuntu/venv/bin/celery -A stargeo worker -l info --no-color
ubuntu 990 0.0 0.4 167936 36648 ? S 11:47 0:00 /home/ubuntu/venv/bin/python /home/ubuntu/venv/bin/celery -A stargeo worker -l info --no-color
ubuntu 991 0.0 0.4 167936 36648 ? S 11:47 0:00 /home/ubuntu/venv/bin/python /home/ubuntu/venv/bin/celery -A stargeo worker -l info --no-color
When I run sudo supervisorctl restart celery it only stops first process python ... honcho one and all the other ones continue. And if I try to kill them they continue (kill -9 works).

This appeared to be a bug with honcho. I ended up with workaround of starting this script from supervisor:
#!/bin/bash
source /home/ubuntu/venv/bin/activate
exec env $(cat .env | grep -v ^# | xargs) \
celery -A stargeo worker -l info --no-color

Related

Supervisord sometimes starts celery, sometimes not

I'm deploying my flask api on Kubernetes. The executed command when the container is started is the following:
supervisord -c /etc/supervisor/conf.d/celery.conf
gunicorn wsgi:app --bind=0.0.0.0:5000 --workers 1 --threads 12 --log-level=warning --access-logfile /var/log/gunicorn-access.log --error-logfile /var/log/gunicorn-error.log
You see above that I'm starting celery first with supervisor and after that I'm running the gunicorn server. Content of celery.conf:
[supervisord]
logfile = /tmp/supervisord.log
logfile_maxbytes = 50MB
logfile_backups=10
loglevel = info
pidfile = /tmp/supervisord.pid
nodaemon = false
minfds = 1024
minprocs = 200
umask = 022
identifier = supervisor
directory = /tmp
nocleanup = true
[program:celery]
directory = /mydir/app
command = celery -A celery_worker.celery worker --loglevel=debug
When logged into my pods I can see that sometimes the process of starting celery is working (example in pod 1):
> more /tmp/supervisord.log
2021-06-08 18:19:46,460 CRIT Supervisor running as root (no user in config file)
2021-06-08 18:19:46,462 INFO daemonizing the supervisord process
2021-06-08 18:19:46,462 INFO set current directory: '/tmp'
2021-06-08 18:19:46,463 INFO supervisord started with pid 9
2021-06-08 18:19:47,469 INFO spawned: 'celery' with pid 15
2021-06-08 18:19:48,470 INFO success: celery entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Sometimes it's not (in pod 2):
> more /tmp/supervisord.log
2021-06-08 18:19:42,979 CRIT Supervisor running as root (no user in config file)
2021-06-08 18:19:42,988 INFO daemonizing the supervisord process
2021-06-08 18:19:42,988 INFO set current directory: '/tmp'
2021-06-08 18:19:42,989 INFO supervisord started with pid 9
2021-06-08 18:19:43,992 INFO spawned: 'celery' with pid 11
2021-06-08 18:19:44,994 INFO success: celery entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
>>>> 2021-06-08 18:19:58,642 INFO exited: celery (exit status 2; expected) <<<<<HERE
In my pod 1, a ps command shows the following:
> ps aux | grep celery
root 9 0.0 0.0 55308 16376 ? Ss 18:45 0:00 /usr/bin/python /usr/bin/supervisord -c /etc/supervisor/conf.d/celery.conf
root 23 2.2 0.8 2343684 352940 ? S 18:45 0:05 /usr/bin/python3 /usr/local/bin/celery -A celery_worker.celery worker --loglevel=debug
root 37 0.0 0.5 2341860 208716 ? S 18:46 0:00 /usr/bin/python3 /usr/local/bin/celery -A celery_worker.celery worker --loglevel=debug
root 38 0.0 0.5 2341864 208716 ? S 18:46 0:00 /usr/bin/python3 /usr/local/bin/celery -A celery_worker.celery worker --loglevel=debug
root 39 0.0 0.5 2341868 208716 ? S 18:46 0:00 /usr/bin/python3 /usr/local/bin/celery -A celery_worker.celery worker --loglevel=debug
root 40 0.0 0.5 2341872 208724 ? S 18:46 0:00 /usr/bin/python3 /usr/local/bin/celery -A celery_worker.celery worker --loglevel=debug
root 41 0.0 0.5 2341876 208728 ? S 18:46 0:00 /usr/bin/python3 /usr/local/bin/celery -A celery_worker.celery worker --loglevel=debug
root 42 0.0 0.5 2341880 208728 ? S 18:46 0:00 /usr/bin/python3 /usr/local/bin/celery -A celery_worker.celery worker --loglevel=debug
root 43 0.0 0.5 2341884 208736 ? S 18:46 0:00 /usr/bin/python3 /usr/local/bin/celery -A celery_worker.celery worker --loglevel=debug
root 44 0.0 0.5 2342836 211384 ? S 18:46 0:00 /usr/bin/python3 /usr/local/bin/celery -A celery_worker.celery worker --loglevel=debug
In my pod 2, I can see that supervisord/celery process is still there but I don't have all the individual /usr/local/bin/celery processes that I have in pod 1:
> ps aux | grep celery
root 9 0.0 0.0 55308 16296 ? Ss 18:19 0:00 /usr/bin/python /usr/bin/supervisord -c /etc/supervisor/conf.d/celery.conf
This behavior is not always the same. Sometimes when pods are restarted the two succeed to launch celery, sometimes none of them succeed. In this last scenario if I make a request to my API that is supposed to launch a celery task, I can see on my broker console (RabbitMQ) that a task is created but there is no message "activity" and nothing is written is my database table (the end result of my celery task).
If I start celery manually in my pods:
celery -A celery_worker.celery worker --loglevel=debug
everything works.
What could explain such a behavior?
Following the comments above, the best solution is to have two containers, the first having the entrypoint gunicorn and the other celery celery-worker. If the second is the same image as the first it works very well and I can scale on Kubernetes each container independently. The only thing is that the code sourcing is more difficult, each time I make a code change on the first I must apply the same changes manually on the second, maybe there is a better way to address this specific issue of the code sourcing.

How to run a process in daemon mode with systemd service?

I've googled and read quite a bit of blogs, posts, etc. on this. I've also been trying them out manually on my EC2 instance. However, I'm still not able to properly configure the systemd service unit to have it run the process in background as I expect. The process I'm running is nessus service. Here's my service unit definition:
$ cat /etc/systemd/system/nessusagent.service
[Unit]
Description=Nessus
[Service]
ExecStart=/opt/myorg/bin/init_nessus
Type=simple
[Install]
WantedBy=multi-user.target
and here is my script /opt/myorg/bin/init_nessus:
$ cat /opt/apiq/bin/init_nessus
#!/usr/bin/env bash
set -e
NESSUS_MANAGER_HOST=...
NESSUS_MANAGER_PORT=...
NESSUS_CLIENT_GROUP=...
NESSUS_LINKING_KEY=...
#-------------------------------------------------------------------------------
# link nessus agent with manager host
#-------------------------------------------------------------------------------
/opt/nessus_agent/sbin/nessuscli agent link --key=${NESSUS_LINKING_KEY} --host=${NESSUS_MANAGER_HOST} --port=${NESSUS_MANAGER_PORT} --groups=${NESSUS_CLIENT_GROUP}
if [ $? -ne 0 ]; then
echo "Cannot link the agent to the Nessus manager, quitting."
exit 1
fi
/opt/nessus_agent/sbin/nessus-service -q -D
When I run the service, I always get the following:
$ systemctl status nessusagent.service
● nessusagent.service - Nessus
Loaded: loaded (/etc/systemd/system/nessusagent.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2020-08-24 06:40:40 UTC; 9min ago
Process: 27787 ExecStart=/opt/myorg/bin/init_nessus (code=exited, status=0/SUCCESS)
Main PID: 27787 (code=exited, status=0/SUCCESS)
...
Aug 24 06:40:40 ip-10-27-0-104 init_nessus[27787]: + /opt/nessus_agent/sbin/nessuscli agent link --key=... --host=... --port=8834 --groups=...
Aug 24 06:40:40 ip-10-27-0-104 init_nessus[27787]: [info] [agent] HostTag::getUnix: setting TAG value to '8596420322084e3ab97d3c39e5c92e00'
Aug 24 06:40:40 ip-10-27-0-104 init_nessus[27787]: [info] [agent] Successfully linked to <myorg.com>:8834
Aug 24 06:40:40 ip-10-27-0-104 init_nessus[27787]: + '[' 0 -ne 0 ']'
Aug 24 06:40:40 ip-10-27-0-104 init_nessus[28506]: + /opt/nessus_agent/sbin/nessus-service -q -D
However, I can't see the process that I expect to see:
$ ps faux | grep nessus
root 28565 0.0 0.0 12940 936 pts/0 S+ 06:54 0:00 \_ grep --color=auto nessus
If I run the last command manually, I can see it:
$ /opt/nessus_agent/sbin/nessus-service -q -D
$ ps faux | grep nessus
root 28959 0.0 0.0 12940 1016 pts/0 S+ 07:00 0:00 \_ grep --color=auto nessus
root 28952 0.0 0.0 6536 116 ? S 07:00 0:00 /opt/nessus_agent/sbin/nessus-service -q -D
root 28953 0.2 0.0 69440 9996 pts/0 Sl 07:00 0:00 \_ nessusd -q
What is it that I'm missing here?
Eventually figured out that this was because of the extra -D option in the last command. Removing the -D option fixed the issue. Running the process in daemon mode inside a system manager is not the way to go. We need to run it in the foreground and let the system manager handle it.

Using Supervisord to manage mongos process

background
I am trying to automate the restarting in case of crash or reboot for mongos process used in mongodb sharded setup.
Case 1 : using direct command, with mongod user
supervisord config
[program:mongos_router]
command=/usr/bin/mongos -f /etc/mongos.conf --pidfilepath=/var/run/mongodb/mongos.pid
user=mongod
autostart=true
autorestart=true
startretries=10
Result
supervisord log
INFO spawned: 'mongos_router' with pid 19535
INFO exited: mongos_router (exit status 0; not expected)
INFO gave up: mongos_router entered FATAL state, too many start retries too quickly
mongodb log
2018-05-01T21:08:23.745+0000 I SHARDING [Balancer] balancer id: ip-address:27017 started
2018-05-01T21:08:23.745+0000 E NETWORK [mongosMain] listen(): bind() failed errno:98 Address already in use for socket: 0.0.0.0:27017
2018-05-01T21:08:23.745+0000 E NETWORK [mongosMain] addr already in use
2018-05-01T21:08:23.745+0000 I - [mongosMain] Invariant failure inShutdown() src/mongo/db/auth/user_cache_invalidator_job.cpp 114
2018-05-01T21:08:23.745+0000 I - [mongosMain]
***aborting after invariant() failure
2018-05-01T21:08:23.748+0000 F - [mongosMain] Got signal: 6 (Aborted).
Process is seen running. But if killed does not restart automatically.
Case 2 : Using init script
here the slight change in the scenario is that some ulimit commands, creation of pid files is to be done as root and then the actual process should be started as mongod user.
mongos script
start()
{
# Make sure the default pidfile directory exists
if [ ! -d $PID_PATH ]; then
install -d -m 0755 -o $MONGO_USER -g $MONGO_GROUP $PIDDIR
fi
# Make sure the pidfile does not exist
if [ -f $PID_FILE ]; then
echo "Error starting mongos. $PID_FILE exists."
RETVAL=1
return
fi
ulimit -f unlimited
ulimit -t unlimited
ulimit -v unlimited
ulimit -n 64000
ulimit -m unlimited
ulimit -u 64000
ulimit -l unlimited
echo -n $"Starting mongos: "
#daemon --user "$MONGO_USER" --pidfile $PID_FILE $MONGO_BIN $OPTIONS --pidfilepath=$PID_FILE
#su $MONGO_USER -c "$MONGO_BIN -f $CONFIGFILE --pidfilepath=$PID_FILE >> /home/mav/startup_log"
su - mongod -c "/usr/bin/mongos -f /etc/mongos.conf --pidfilepath=/var/run/mongodb/mongos.pid"
RETVAL=$?
echo -n "Return value : "$RETVAL
echo
[ $RETVAL -eq 0 ] && touch $MONGO_LOCK_FILE
}
daemon comman represents original script, but daemonizing under the supervisord is not logical, so using command to run the process in foreground(?)
supervisord config
[program:mongos_router_script]
command=/etc/init.d/mongos start
user=root
autostart=true
autorestart=true
startretries=10
Result
supervisord log
INFO spawned: 'mongos_router_script' with pid 20367
INFO exited: mongos_router_script (exit status 1; not expected)
INFO gave up: mongos_router_script entered FATAL state, too many start retries too quickly
mongodb log
Nothing indicating error, normal logs
Process is seen running. But if killed does not restart automatically.
Problem
How to correctly configure script / no script option for running mongos under supervisord ?
EDIT 1
Modified Command
sudo su -c "/usr/bin/mongos -f /etc/mongos.conf --pidfilepath=/var/run/mongodb/mongos.pid" -s /bin/bash mongod`
This works if ran individually on command line as well as part of the script, but not with supervisord
EDIT 2
Added following option to config file for mongos to force it to run in the foreground
processManagement:
fork: false # fork and run in background
Now command line and script properly run it in the foreground but supervisord fails to launch it. At the same time there are 3 processes show up when ran from command line or script
root sudo su -c /usr/bin/mongos -f /etc/mongos.conf --pidfilepath=/var/run/mongodb/mongos.pid -s /bin/bash mongod
root su -c /usr/bin/mongos -f /etc/mongos.conf --pidfilepath=/var/run/mongodb/mongos.pid -s /bin/bash mongod
mongod /usr/bin/mongos -f /etc/mongos.conf --pidfilepath=/var/run/mongodb/mongos.pid
EDIT 3
With following supervisord config things are working fine. But I want to try and execute the script if possible to set ulimit
[program:mongos_router]
command=/usr/bin/mongos -f /etc/mongos.conf --pidfilepath=/var/run/mongodb/mongos.pid
user=mongod
autostart=true
autorestart=true
startretries=10
numprocs=1
For the mongos to run in the foreground set the following option
#how the process runs
processManagement:
fork: false # fork and run in background
with that and above supervisord.conf setting, mongos will be launched and under the supervisord control

Stopping supervisord: Shut down

I tired to start supervisor but getting error. Can anyone help? Thanks
/etc/init.d/supervisord file.
SUPERVISORD=/usr/local/bin/supervisord
SUPERVISORCTL=/usr/local/bin/supervisorctl
case $1 in
start)
echo -n "Starting supervisord: "
$SUPERVISORD
echo
;;
stop)
echo -n "Stopping supervisord: "
$SUPERVISORCTL shutdown
echo
;;
restart)
echo -n "Stopping supervisord: "
$SUPERVISORCTL shutdown
echo
echo -n "Starting supervisord: "
$SUPERVISORD
echo
;;
esac
Then run these
sudo chmod +x /etc/init.d/supervisord
sudo update-rc.d supervisord defaults
sudo /etc/init.d/supervisord start
And getting this:
Stopping supervisord: Shut down
Starting supervisord: /usr/local/lib/python2.7/dist-packages/supervisor/options.py:286: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
Error: Another program is already listening on a port that one of our HTTP servers is configured to use. Shut this program down first before starting supervisord.
For help, use /usr/local/bin/supervisord -h
Conf file (located at /etc/supervisord.conf):
[unix_http_server]
file=/tmp/supervisor.sock; (the path to the socket file)
[supervisord]
logfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log)
logfile_maxbytes=50MB ; (max main logfile bytes b4 rotation;default 50MB)
logfile_backups=10 ; (num of main logfile rotation backups;default 10)
loglevel=info ; (log level;default info; others: debug,warn,trace)
pidfile=/tmp/supervisord.pid ; (supervisord pidfile;default supervisord.pid)
nodaemon=false ; (start in foreground if true;default false)
minfds=1024 ; (min. avail startup file descriptors;default 1024)
minprocs=200 ; (min. avail process descriptors;default 200)
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[supervisorctl]
serverurl=unix:///tmp/supervisor.sock; use a unix:// URL for a unix socket
[program:myproject]
command=/home/richard/envs/myproject_stage/bin/python /home/richard/webapps/myproject/manage.py run_gunicorn -b 127.0.0.1:8002 --log-file=/tmp/myproject_stage_gunicorn.log
directory=/home/richard/webapps/myproject/
user=www-data
autostart=true
autorestart=true
stdout_logfile=/tmp/myproject_stage_supervisord.log
redirect_stderr=true
first of all, type this on your console or terminal
ps -ef | grep supervisord
You will get some pid of supervisord just like these
root 2641 12938 0 04:52 pts/1 00:00:00 grep --color=auto supervisord
root 29646 1 0 04:45 ? 00:00:00 /usr/bin/python /usr/local/bin/supervisord
if you get output like that, your pid is the second one. then if you want to shut down your supervisord you can do this
kill -s SIGTERM 29646
hope it helpful. ref: http://supervisord.org/running.html#signals
sudo unlink /tmp/supervisor.sock
This .sock file is defined in /etc/supervisord.conf's [unix_http_server]'s file config value (default is /tmp/supervisor.sock).
$ ps aux | grep supervisor
alexamil 54253 0.0 0.0 2506960 6440 ?? Ss 10:09PM 0:00.26 /usr/bin/python /usr/local/bin/supervisord -c supervisord.conf
so we can use:
$ pkill -f supervisord # kill it
This is what works for me. Run the following in the Terminal (For Linux machines)
To check if the process is running:
sudo systemctl status supervisor
To stop the process:
sudo systemctl stop supervisor
Try running these commands
sudo unlink /run/supervisor.sock
and
sudo /etc/init.d/supervisor start
As of version 3.0a11, you could do this one-liner:
sudo kill -s SIGTERM $(sudo supervisorctl pid) which hops on the back of the supervisorctl pid function.
There are many answers already available. I shall present a cleaner way to shut down supervisord.
supervisord by default, creates a file named supervisord.pid in the directory where supervisord.conf file exists. That file consists the pid of the supervisord daemon. Read the pid from the file and kill the supervisord process.
However, you can configure where the supervisord.pid file should be created. Refer this link to configure it, http://supervisord.org/configuration.html

memcached restart starts a new memcached and doesn't kill the old one

I'm running my rails app in production mode and in staging mode on the same server, in different folders. They both use memcache-client which requires memcached to be running.
As yet i haven't set up a deploy script and so just do a deploy manually by sshing onto the server, going to the appropriate directory, updating the code, restarting memcached and then restarting unicorn (the processes which actually run the rails app). I restart memcached thus:
sudo /etc/init.d/memcached restart &
This starts a new memcached, but it doesn't kill the old one: check it out:
ip-<an-ip>:test.millionaire[subjects]$ ps afx | grep memcache
11176 pts/2 S+ 0:00 | \_ grep --color=auto memcache
10939 pts/3 R 8:13 \_ sudo /etc/init.d/memcached restart
7453 ? Sl 0:00 /usr/bin/memcached -m 64 -p 11211 -u nobody -l 127.0.0.1
ip-<an-ip>:test.millionaire[subjects]$ sudo /etc/init.d/memcached restart &
[1] 11187
ip-<an-ip>:test.millionaire[subjects]$ ps afx | grep memcache
11187 pts/2 T 0:00 | \_ sudo /etc/init.d/memcached restart
11199 pts/2 S+ 0:00 | \_ grep --color=auto memcache
10939 pts/3 R 8:36 \_ sudo /etc/init.d/memcached restart
7453 ? Sl 0:00 /usr/bin/memcached -m 64 -p 11211 -u nobody -l 127.0.0.1
[1]+ Stopped sudo /etc/init.d/memcached restart
ip-<an-ip>:test.millionaire[subjects]$ sudo /etc/init.d/memcached restart &
[2] 11208
ip-<an-ip>:test.millionaire[subjects]$ ps afx | grep memcache
11187 pts/2 T 0:00 | \_ sudo /etc/init.d/memcached restart
11208 pts/2 R 0:01 | \_ sudo /etc/init.d/memcached restart
11218 pts/2 S+ 0:00 | \_ grep --color=auto memcache
10939 pts/3 R 8:42 \_ sudo /etc/init.d/memcached restart
7453 ? Sl 0:00 /usr/bin/memcached -m 64 -p 11211 -u nobody -l 127.0.0.1
What might be causing it is there's another memcached running - see the bottom line. I'm mystified as to where this is from and my instinct is to kill it but i thought i'dd better check with someone who actually knows more about memcached than i do.
Grateful for any advice - max
EDIT - solution
I figured this out after a bit of detective work with a colleague. In the rails console i typed CACHE.stats which prints out a hash of values, including "pid", which i could see was set to the instance of memcached which wasn;t started with memcached restart, ie this process:
7453 ? Sl 0:00 /usr/bin/memcached -m 64 -p 11211 -u nobody -l 127.0.0.1
The memcached control script (ie that defines the start, stop and restart commands), is in /etc/init.d/memcached
A line in this says
# Edit /etc/default/memcached to change this.
ENABLE_MEMCACHED=no
So i looked in /etc/default/memcached, which was also set to ENABLE_MEMCACHED=no
So, this was basically preventing memcached from being stopped and started. I changed it to ENABLE_MEMCACHED=yes, then it would stop and start fine. Now when i stop and start memcached, it's the above process, the in-use memcached, that's stopped and started.
try using:
killall memcached