supervisord autorestart max tries? - supervisord

http://supervisord.org/configuration.html#program-x-section-values says you can use autorestart=true to restart on exit, but doesn't say how to give a maximum amount of restarts (within startsecs) before giving up. Is there a way to do this? Note: I'm not talking about the first startup, but about the event that a program crashes after, say, running fine for 10 days.
According to the docs, autorestart doesn't care about startretries:
autorestart controls whether supervisord will autorestart a program if
it exits after it has successfully started up (the process is in the
RUNNING state).
supervisord has a different restart mechanism for when the process is
starting up (the process is in the STARTING state). Retries during
process startup are controlled by startsecs and startretries.

You should use startretries as well, ex of program configuration:
[program:consumer_example]
command=command example
process_name=%(program_name)s_%(process_num)02d
numprocs=1
autostart=true
autorestart=true
startretries=10
user=USERNAME
As you can see I used startretries with 10, when you not inform into program it uses the default value (3).

I think that you need is to use the startretries parameter..
http://supervisord.org/configuration.html?highlight=startretries#program-x-section-example
best regards

Related

Supervisor kills Prefect agent with SIGTERM unexpectedly

I'm using a rapsberry pi 4, v10(buster).
I installed supervisor per the instructions here: http://supervisord.org/installing.html
Except I changed "pip" to "pip3" because I want to monitor running things that use the python3 kernel.
I'm using Prefect, and the supervisord.conf is running the program with command=/home/pi/.local/bin/prefect "agent local start" (I tried this with and without double quotes)
Looking at the supervisord.log file it seems like the Prefect Agent does start, I see the ASCII art that normally shows up when I start it from the command line. But then it shows it was terminated by SIGTERM;not expected, WARN recieved SIGTERM inidicating exit request.
I saw this post: Supervisor gets a SIGTERM for some reason, quits and stops all its processes but I don't even have that 10Periodic file it references.
Anyone know why/how Supervisor processes are getting killed by sigterm?
It could be that your process exits immediately because you don’t have an API key in your command and this is required to connect your agent to the Prefect Cloud API. Additionally, it’s a best practice to always assign a unique label to your agents, below is an example with “raspberry” as a label.
You can also check the logs/status:
supervisorctl status
Here is a command you can try, plus you can specify a directory in your supervisor config (not sure whether environment variables are needed but I saw it from other raspberry Pi supervisor user):
[program:prefect-agent]
command=prefect agent local start -l raspberry -k YOUR_API_KEY --no-hostname-label
directory=/home/pi/.local/bin/prefect
user=pi
environment=HOME="/home/pi/.local/bin/prefect",USER="pi"

How to set PIDFile for systemd when main process start multiple child and exit?

Environment: Ubuntu 16.04, daemon programmed in c, using systemd for process management.
So i have the unit file as:
[Unit]
Description=Fantastic Service
After=network.target
[Service]
Restart=always
Type=forking
ExecStart=/opt/fan/tastic
[Install]
WantedBy=multi-user.target
And in my tastic.c code, it basically fork() X number of childs each doing so_reuseport, and than the main process exits leaving the childs to handle requests.
With the above setup it works fine, and i get the expected behavior.
However if i put the PIDFile in the service unit file, i get that the pid provided by my application is non-existent, which it is - since my main process is exiting after starting up the requested number of childs.
Now in the systemd documentation it clearly states that if you do Type=forking you should provide the PIDFile, but the issue is that how am i supposed to provide a single pid file when there are multiple childs and the main parent process exits once the childs start?
Am I missing something?
As you found, the system works fine without PIDFile= in your case. The docs recommend the use of PIDFile=, but I believe that's for the case when there is a single main process, which doesn't apply to your case.
Also see man systemd.kill which explains how processes will be killed. The default is "control-group", which kills "all remaining processes in the control group".
So by default, systemd is going to clean up all those child processes at "stop" time for you, which is what you want.
For someone who did have a main process, they might want to use KillMode=process, and in that case setting PIDFile= may help with that, but this does not apply to your case.

systemd `systemctl stop` aggressively kills subprocesses

I've a daemon-like process that starts two subprocesses (and one of the subprocesses starts ~10 others). When I systemctl stop my process the child subprocesses appear to be 'aggressively' killed by systemctl - which doesn't give my process a chance to clean up.
How do I get systemctl stop to quit the aggressive kill and thus to allow my process to orchestrate an orderly clean up?
I tried timeoutSec=30 to no avail.
KillMode= defaults to control-group. That means every process of your service is killed with SIGTERM.
You have two options:
Handle SIGTERM in each of your processes and shutdown within TimeoutStopSec (which defaults to 90 seconds)
If you really want to delegate the shutdown from your main process, set KillMode=mixed. SIGTERM will be sent to the main process only. Then again shutdown within TimeoutStopSec. If you do not shutdown within TimeoutStopSec, systemd will send SIGKILL to all your processes.
Note: I suggest to use KillMode=mixed in option 2 instead of KillMode=process, as the latter would send the final SIGKILL only to your main process, which means your sub-processes would not be killed if they've locked up.
A late (possible) answer, but as I googled for weeks with a similar issue, finding nothing, I figured I add my solution.
My error was that I ran the systemd unit as root and switched (using sudo) to "the correct" user in the startscript (inherited from SysVinit script).
That starts the processes in the user.slice which is killed mercilessly on shutdown. When I changed the unit file to run as the correct user (USER=myuser) and removed sudo from the start script, the processes start in the system.slice and get properly handled on shutdown.

mongod main process killed by KILL signal

One of the mongo nodes in the replica set went down today. I couldn't find what happened but when i checked the logs on the server, I saw this message 'mongod main process killed by KILL signal'. I tried googling for more information but failed. Basically i like to know what is KILL signal, who triggered it and possible causes/fixes.
Mongo version 3.2.10 on Ubuntu.
The KILL signal means that the app will be killed instantly and there is no chance left for the process to exit cleanly. It is issued by the system when something goes very wrong.
If this is the only log left, it was killed abruptly. Probably this means that your system ran out of memory (I've had this problem with other processes before). You could check if swap is configured on your machine (by using swapon -s), but perhaps you should consider adding more memory to your server, because swap would be just for it not to break, as it is very slow.
Another thing worth looking at is the free disk space left and the syslog (/var/log/syslog)

How to force supervisord to stop a process in BACKOFF status

When you start a process using supervisord it is in "STARTING" status then if it gets trouble it gets in "BACKOFF" status if there is an autorestart set to true.
I don't want to wait for "startretries" to be attempted, I want to stop the restarting process manually using supervisorctl. The only way I found to do so is to stop the entire supervisord service and start it again (every process go in "STOPPED" status if there is no autostart).
Is there a better way to do so (force "STOPPED" status from "BACKOFF" status) as I have other processes managed in supervisord that I don't want to stop?
If I try to stop it with
supervisorctl stop process
I get
FAILED: attempted to kill process with sig SIGTERM but it wasn't running
If I try to start it with
supervisorctl start process
I get
process: ERROR (already started)
Of course I could disable the autorestart, but it can be useful, a workaround is to limit the startretries, is there a better solution?
Hey this maybe help for you:
When an autorestarting process is in the BACKOFF state, it will be
automatically restarted by supervisord. It will switch between
STARTING and BACKOFF states until it becomes evident that it cannot be started because the number of startretries has exceeded
the maximum, at which point it will transition to the FATAL state.
Each start retry will take progressively more time.
so you don't need to stop the BACKOFF process manually. if you do not want to wait too long time, it is better to set a little number to startretries.
see more info here: http://supervisord.org/subprocess.html
GOOD LCUKY
Use the following command to force supervisor to stop a process in the BACKOFF state.
supervisorctl stop <gname>:*