Restart nodes in state down - centos

after a power outage my nodes went to state down
sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
partMain up infinite 4 down* node[001-004]
part1* up infinite 3 down* node[002-004]
part2 up infinite 1 down* node001
I do these commands
/etc/init.d/slurm stop
/etc/init.d/slurm start
sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
partMain up infinite 4 down node[001-004]
part1* up infinite 3 down node[002-004]
part2 up infinite 1 down node001
how could I restart my nodes ?
sinfo -R
REASON USER TIMESTAMP NODELIST
Not responding root 2019-07-23T08:40:25 node[001-004]
$ scontrol update nodename=node001 state=idle
$ scontrol update nodename=node[001-004] state=resume
# the state changes to idle* but for a few seconds then returns to down*
$service --status-all | grep 'slurm'
slurmctld (pid 24000) is running... slurmdbd (pid 4113) is running...
$systemctl status -l slurm
● slurm.service - LSB: slurm daemon management
Loaded: loaded (/etc/rc.d/init.d/slurm; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2019-07-24 13:45:38 CEST; 257ms ago
Docs: man:systemd-sysv-generator(8)
Process: 30094 ExecStop=/etc/rc.d/init.d/slurm stop (code=exited, status=1/FAILURE)
Process: 30061 ExecStart=/etc/rc.d/init.d/slurm start (code=exited, status=0/SUCCESS)
Main PID: 30069 (code=exited, status=1/FAILURE)

Try with this after initiating the daemons:
scontrol update nodename=node001 state=idle

See the reason why they are marked as down with sinfo -R. Most probably, they will be listed as "unexpectedly rebooted". You can resume them with
scontrol update nodename=node[001-004] state=resume
The ReturnToService parameter of slurm.conf controls whether or not the compute nodes are active when they wake up from an unexpected reboot.

Related

Mongod Process starts but Systemctl status shows failed

I have created a mongod configuration in /home/cluster1.conf with the same port I used previously for etc/mongod.conf.
I used to run etc/mongod.conf as a sudo user but expect to run cluster1.conf as user.
when I run cluster1.conf as user, the process starts. I accessed it from mongo shell and did operations. But when I'm trying to access it from another vm from the same VPC, it failed.
I checked for systemctl status mongod and it shows service is not running.
Why does systemctl status shows it not running when it really does? How can I resolve this to create a replica set??
"when I run cluster1.conf as user, the process starts." does not make much sense. cluster1.conf is a configuration file, it does not start anything.
When you run systemctl status mongod, then it typically shows
● mongod.service - MongoDB Database Server
Loaded: loaded (/etc/systemd/system/mongod.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2021-08-27 09:44:26 CEST; 2 weeks 3 days ago
Docs: https://docs.mongodb.org/manual
Main PID: 38710 (mongod)
Tasks: 28
Memory: 270.8M
CGroup: /system.slice/mongod.service
└─38710 /usr/bin/mongod -f /etc/mongod.conf
Check the service file /etc/systemd/system/mongod.service there you see entry like
Environment="OPTIONS=-f /etc/mongod.conf"
EnvironmentFile=-/etc/sysconfig/mongod
ExecStart=/usr/bin/mongod $OPTIONS
Which means, config file /etc/mongod.conf is used (rather than /home/cluster1.conf)

systemd service not starting on boot, starts when i restart it

I have made this service file to start a python script when my raspberry pi (4) boots up:
/etc/systemd/system/plants.service
[Unit]
Description=plant-sender
After=network.target
[Service]
Type=simple
User=root
Group=root
WorkingDirectory=/home/theo/Repos/plants-monitor/remote
ExecStart=/usr/bin/python main.py
Restart=on-failure
[Install]
WantedBy=multi-user.target
However, once the pi is on, I run sudo systemctl status plants, and get:
* plants.service - plant-sender
Loaded: loaded (/etc/systemd/system/plants.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2020-03-30 20:22:43 EDT; 1min 45s ago
Process: 323 ExecStart=/usr/bin/python main.py (code=exited, status=1/FAILURE)
Main PID: 323 (code=exited, status=1/FAILURE)
Mar 30 20:22:43 arpi systemd[1]: plants.service: Scheduled restart job, restart counter is at 5.
Mar 30 20:22:43 arpi systemd[1]: Stopped plant-sender.
Mar 30 20:22:43 arpi systemd[1]: plants.service: Start request repeated too quickly.
Mar 30 20:22:43 arpi systemd[1]: plants.service: Failed with result 'exit-code'.
Mar 30 20:22:43 arpi systemd[1]: Failed to start plant-sender.
But, after running sudo systemctl restart plants, the service starts up and everything is fine.
If it doesn't start on boot but does on systemctl restart, I'd be looking at whether /home/theo/Repos/plants-monitor/remote is mounted at that point.
There may be something automounting or home-mounting your home directory when you log in.
If so, you could change the working directory to something that exists always, even if only a test.
Additionally, using journalctl -n 9999 -u plants will get you more log messages, so you can see why it's failing, rather than just seeing the "tried too many times, giving up" messages.

Fatal error when starting orion context broker

My orion context broker does not start and when I enter the command
/etc/init.d/contextBroker start I get this message
[root#context-broker ~]# /etc/init.d/contextBroker start
Starting contextBroker (via systemctl): Job for contextBroker.service failed because the control process exited with error code. See "systemctl status contextBroker.service" and "journalctl -xe" for details.
[FAILED]
The systemctl status contextBroker.service commannd gives this message
[root#context-broker ~]# systemctl status contextBroker.service
● contextBroker.service - LSB: run contextBroker
Loaded: loaded (/etc/rc.d/init.d/contextBroker; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2019-05-24 11:38:50 UTC; 1min 11s ago
Docs: man:systemd-sysv-generator(8)
Process: 9782 ExecStart=/etc/rc.d/init.d/contextBroker start (code=exited, status=1/FAILURE)
May 24 11:38:47 context-broker.novalocal systemd[1]: Starting LSB: run contextBroker...
May 24 11:38:48 context-broker.novalocal contextBroker[9782]: contextBroker is stopped
May 24 11:38:48 context-broker.novalocal contextBroker[9782]: Starting...
May 24 11:38:48 context-broker.novalocal su[9788]: (to orion) root on none
May 24 11:38:50 context-broker.novalocal contextBroker[9782]: Starting contextBroker... cat: /var/run/contextBroker/contextBroker.pid...irectory
May 24 11:38:50 context-broker.novalocal systemd[1]: contextBroker.service: control process exited, code=exited status=1
May 24 11:38:50 context-broker.novalocal contextBroker[9782]: pidfile not found[FAILED]
May 24 11:38:50 context-broker.novalocal systemd[1]: Failed to start LSB: run contextBroker.
May 24 11:38:50 context-broker.novalocal systemd[1]: Unit contextBroker.service entered failed state.
May 24 11:38:50 context-broker.novalocal systemd[1]: contextBroker.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
Also the /tmp/contextBroker.log file looks like this
time=2019-05-24T11:41:12.971Z | lvl=FATAL | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=rest.cpp[1753]:restStart | msg=Fatal Error (error starting REST interface)
I checked if mongodb is running and it is running correctly.
UPDATE
With some searching I realised I had to kill the pid of the process and after I did that the service successfully starts according to the messaags but I find it doesnt actually work. When I ask for the status I get the following:
[root#context-broker centos]# /etc/init.d/contextBroker status
● contextBroker.service - LSB: run contextBroker
Loaded: loaded (/etc/rc.d/init.d/contextBroker; bad; vendor preset: disabled)
Active: active (exited) since Sun 2019-05-26 18:34:49 UTC; 4min 56s ago
Docs: man:systemd-sysv-generator(8)
Process: 16295 ExecStop=/etc/rc.d/init.d/contextBroker stop (code=exited, status=0/SUCCESS)
Process: 16319 ExecStart=/etc/rc.d/init.d/contextBroker start (code=exited, status=0/SUCCESS)
May 26 18:34:47 context-broker.novalocal systemd[1]: Starting LSB: run contextBroker...
May 26 18:34:47 context-broker.novalocal contextBroker[16319]: contextBroker is stopped
May 26 18:34:47 context-broker.novalocal contextBroker[16319]: Starting...
May 26 18:34:47 context-broker.novalocal su[16325]: (to orion) root on none
May 26 18:34:49 context-broker.novalocal systemd[1]: Started LSB: run contextBroker.
May 26 18:34:49 context-broker.novalocal contextBroker[16319]: Starting contextBroker... [ OK ]
The log file has the same message as previously.
With some searching again I believe the cause is that the service doesnt have a daemon(??). So if that is the case how do I add one?
Normally when getting the error starting REST interface, it's because there is already a broker running, which means the port is already taken. Make sure there is no broker already running.

Should systemctl show output when start/stop fails?

I've googled every variation of this question I can think of, but I'm just getting questions about failed services, not about how systemctl treats them. I have a service that I've been running as an init.d script. We're using systemctl now, fine. I created a service file that's a lightly modified version of the file automatically generated by systemd-sysv-generator. For ExecStart and ExecStop, it calls a bash script that returns 0 if the start/stop was successful, and non-zero if it was not.
I understand that there's no output from "systemctl start/stop" if it was successful. But I also don't get any output if either of the calls failed. The return code of the systemctl start/stop command is always 0 even if the return code of the source script is not. It's very clear it did fail because it shows as failed when I run the status command.
Is that expected behavior? Should it not give any indication that something failed unless you run a separate status command? And if that's not how it should behave, how can I make it indicate that a failure occurred?
Service file below.
[Unit]
SourcePath=/my/service/script.sh
[Service]
Type=forking
Restart=no
TimeoutSec=5min
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
ExecStart=/my/service/script.sh start
ExecStop=/my/service/script.sh stop
Working fine here, CentOS 7. Maybe double check that script.sh is really returning non zero?
$pwd
/etc/systemd/system
$cat me.service
[Unit]
[Service]
Type=forking
Restart=no
TimeoutSec=5min
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
ExecStart=/etc/systemd/system/die.sh
$cat die.sh
#!/bin/bash
echo "dying"
exit 1
$sudo systemctl start me
Job for me.service failed because the control process exited with error code. See "systemctl status me.service" and "journalctl -xe" for details.
$sudo systemctl status me.service
● me.service
Loaded: loaded (/etc/systemd/system/me.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2019-09-12 17:55:02 GMT; 8s ago
Process: 19758 ExecStart=/etc/systemd/system/die.sh (code=exited, status=1/FAILURE)
Sep 12 17:55:02 dpsdev-wkr01 systemd[1]: Starting me.service...
Sep 12 17:55:02 dpsdev-wkr01 die.sh[19758]: dying
Sep 12 17:55:02 dpsdev-wkr01 systemd[1]: me.service: control process exited, code=exited status=1
Sep 12 17:55:02 dpsdev-wkr01 systemd[1]: Failed to start me.service.
Sep 12 17:55:02 dpsdev-wkr01 systemd[1]: Unit me.service entered failed state.
Sep 12 17:55:02 dpsdev-wkr01 systemd[1]: me.service failed.

postgresql could not properly start exited with active (exited)

I installed posgresql from digitalocean and in the end of installation prints the below command in terminal
/usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start
I tried to run it with sudo root user and also with switching to postgres user but gives me below error
waiting for server to start..../bin/sh: 1: cannot create logfile:
Permission denied stopped waiting pg_ctl: could not start server
but when i check the status it says
● postgresql.service - PostgreSQL RDBMS Loaded: loaded
(/lib/systemd/system/postgresql.service; enabled; vendor preset:
enabled) Active: active (exited) since Thu 2018-05-31 13:11:18 UTC;
56s ago Main PID: 3698 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 2362) CGroup: /system.slice/postgresql.service
May 31 13:11:18 staging systemd1: Starting PostgreSQL RDBMS... May
31 13:11:18 staging systemd1: Started PostgreSQL RDBMS.
Status is not running except is exited.what the above command do and how can i run it ? I haven't faced it in previous versions
The idea is that you supply your actual log file instead of logfile, but I recommend that you configure logging properly in postgresql.conf and use pg_ctl without the -l option.
Set logging_collector to on.
Set log_filename to postgresql-%a.log.
Set log_rotation_size to 0.
Set log_truncate_on_rotation to on.
Then you'll get the log files in the log subdirectory of your PostgreSQL data directory, and they will be rotated on a weekly basis.