mongod main process killed by KILL signal - mongodb

One of the mongo nodes in the replica set went down today. I couldn't find what happened but when i checked the logs on the server, I saw this message 'mongod main process killed by KILL signal'. I tried googling for more information but failed. Basically i like to know what is KILL signal, who triggered it and possible causes/fixes.
Mongo version 3.2.10 on Ubuntu.

The KILL signal means that the app will be killed instantly and there is no chance left for the process to exit cleanly. It is issued by the system when something goes very wrong.
If this is the only log left, it was killed abruptly. Probably this means that your system ran out of memory (I've had this problem with other processes before). You could check if swap is configured on your machine (by using swapon -s), but perhaps you should consider adding more memory to your server, because swap would be just for it not to break, as it is very slow.
Another thing worth looking at is the free disk space left and the syslog (/var/log/syslog)

Related

how to check if postgres gracefully shutdown

I have some code that will perform a pg_rewind if it detects that the standby is out of sync. However I saw in the documentation that it requires that the database was gracefully shutdown before this command is run. I would like to be able to detect if postgres did not shut down gracefully so I can:
start and stop postgres ahead of time so pg_rewind will work
know if I should run some checks on my data to see if it is ok
I'm assuming that having it shut down non-gracefully either means it crashed, the server crashed or was told to shutdown immediately, so it would be nice to know if something bad happened and I should do something like run pg_checksums.
pg_rewind and pg_checksums both require a cleanly shutdown server.
You could probably try and replicate PostgreSQL's own checks that normally lead to a "database system was not properly shut down" entry that appears in the server log after startup. But you can just as well simply use those - and since they require a startup, you can also let it attempt to recover and perform a fresh, clean, graceful smart shutdown. To avoid any immediate connection attempts, you could use a different port for that cycle.
For a regular server, checking pg_is_in_recovery(); after startup would be some indication of a non-graceful shutdown taking place earlier, which causes the db to enter recovery mode on the next startup. However, by design a standby always stays in recovery mode until promoted, so that won't mean the same thing here.

Recurring linux process consuming cpu

On my opensuse server, I keep seeing this process coming up.
I've tried kill -9 and it comes back with a new process id within 30 seconds.
htop lists it as "bash", while top lists it as "xs".
The attached screenshot is what I could get from ps.
It stays after multiple reboots.
It doesn't seem like a normal zombie process to me.
Wondering if anyone has any advice?
Thanks
ps info

pg_create_logical_replication_slot hanging indefinitely due to old walsender process

I am testing logical replication between 2 PostgreSQL 11 databases for use on our production (I was able to set it thanks to this answer - PostgreSQL logical replication - create subscription hangs) and it worked well.
Now I am testing scripts and procedure which would set it automatically on production databases but I am facing strange problem with logical replication slots.
I had to restart logical replica due to some changes in setting requiring restart - which of course could happen on replicas also in the future. But logical replication slot on master did not disconnect and it is still active for certain PID.
I dropped subscription on master (I am still only testing) and tried to repeat the whole process with new logical replication slot but I am facing strange situation.
I cannot create new logical replication slot with the new name. Process running on the old logical replication slot is still active and showing wait_event_type=Lock and wait_event=transaction.
When I try to use pg_create_logical_replication_slot to create new logical replication slot I get similar situation. New slot is created - I see it in pg_catalog but it is marked as active for the PID of the session which issued this command and command hangs indefinitely. When I check processes I can see this command active with same waiting values Lock/transaction.
I tried to activate parameter "lock_timeout" in postgresql.conf and reload configuration but it did not help.
Killing that old hanging process will most likely bring down the whole postgres because it is "walsender" process. It is visible in processes list still with IP of replica with status "idle wating".
I tried to find some parameter(s) which could help me to force postgres to stop this walsender. But settings wal_keep_segments or wal_sender_timeout did not change anything. I even tried to stop replica for longer time - no effect.
Is there some way to do something with this situation without restarting the whole postgres? Like forcing timeout for walsender or lock for transaction etc...
Because if something like this happens on production I would not be able to use restart or any other "brute force". Thanks...
UPDATE:
"Walsender" process "died out" after some time but log does not show anything about it so I do not know when exactly it happened. I can only guess it depends on tcp_keepalives_* parameters. Default on Debian 9 is 2 hours to keep idle process. So I tried to set these parameters in postgresql.conf and will see in following tests.
Strangely enough today everything works without any problems and no matter how I try to simulate yesterday's problems I cannot. Maybe there were some network communication problems in the cloud datacenter involved - we experienced some occasional timeouts in connections into other databases too.
So I really do not know the answer except for "wait until walsender process on master dies" - which can most likely be influenced by tcp_keepalives_* settings. Therefore I recommend to set them to some reasonable values in postgresql.conf because defaults on OS are usually too big.
Actually we use it on our big analytical databases (set both on PostgreSQL and OS) because of similar problems. Golang and nodejs programs calculating statistics from time to time failed to recognize that database session ended or died out in some cases and were hanging until OS ended the connection after 2 hours (default on Debian). All of it seemed to be always connected with network communication problems. With proper tcp_keepalives_* setting reaction is much quicker in case of problems.
After old walsender process dies on master you can repeat all steps and it should work. So looks like I just had bad luck yesterday...

cf stop command does not perform graceful shutdown on bluemix

I have a node app in bluemix which holds some transaction cache in memory and I would like to flush this cache to DB before the application goes down. So I have the appropriate event handlers to intercept SIGTERM/SIGINT signals and all works fine from my laptop, however, it seems like the cf stop command does not perform graceful shutdown.
Unfortunately, there is no clear documentation around this topic, at one place in the cloudfoundary app-lifecycle doc they do mention that first SIGTERM is issued and then wait for 10 secs etc but Im not seeing this happening. Probably a bug on their side. https://docs.cloudfoundry.org/devguide/deploy-apps/app-lifecycle.html
Has anyone noticed this issue and probably have a workaround pls?
CF is sending the SIGTERM first but because of how the app is started by other processes, it's not being correctly propagated to your app.
As a workaround, disable App Management by setting the CF environment variable BLUEMIX_APP_MGMT_INSTALL=false and prefix your app's start command in your package.json file with 'exec' (e.g. exec node app.js).

systemd `systemctl stop` aggressively kills subprocesses

I've a daemon-like process that starts two subprocesses (and one of the subprocesses starts ~10 others). When I systemctl stop my process the child subprocesses appear to be 'aggressively' killed by systemctl - which doesn't give my process a chance to clean up.
How do I get systemctl stop to quit the aggressive kill and thus to allow my process to orchestrate an orderly clean up?
I tried timeoutSec=30 to no avail.
KillMode= defaults to control-group. That means every process of your service is killed with SIGTERM.
You have two options:
Handle SIGTERM in each of your processes and shutdown within TimeoutStopSec (which defaults to 90 seconds)
If you really want to delegate the shutdown from your main process, set KillMode=mixed. SIGTERM will be sent to the main process only. Then again shutdown within TimeoutStopSec. If you do not shutdown within TimeoutStopSec, systemd will send SIGKILL to all your processes.
Note: I suggest to use KillMode=mixed in option 2 instead of KillMode=process, as the latter would send the final SIGKILL only to your main process, which means your sub-processes would not be killed if they've locked up.
A late (possible) answer, but as I googled for weeks with a similar issue, finding nothing, I figured I add my solution.
My error was that I ran the systemd unit as root and switched (using sudo) to "the correct" user in the startscript (inherited from SysVinit script).
That starts the processes in the user.slice which is killed mercilessly on shutdown. When I changed the unit file to run as the correct user (USER=myuser) and removed sudo from the start script, the processes start in the system.slice and get properly handled on shutdown.