slurm systemd wait nfs mount all folder

slurm systemd wait nfs mount all folder - centos

i use slurm and i want that my deamon slurmd in systemd wait that my nfs mount.
this is my slurmd.service :
[Unit]
Description=Slurm node daemon
After=network.target nfs-client.target nfs-client.service
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
I want that my service runs when my nfs is fully assembled. My nfs is located in /nfs
Because my network is slow and I have big nfs, I need to wait 1 minute for my nfs to be fully assembled.
because I need that slurm to write files in /nfs/slurm folder
actually, when centos start and slurmd deamon start, I have this error "/nfs/slurm no such file or folder"
I try to use PathExist parameter but not work and TimeoutStartSec but not work, my deamon run and I have this error.
Thanks in advance for your help.

Systemd has RequiresMountsFor.
You can add the following line to the [Unit] section of slurmd.service.
RequiresMountsFor=/nfs
Keep in mind, if you are using the resume and suspend features that your ResumeTimeout must be greater than the resume time + nfs mount time.

Write a script that will check if the nfs mount is available, else sleep for X seconds. Then use this script in ExecStartPre= of systemd service file.
$ cat /usr/local/bin/checkNFSMount
#!/bin/bash
if [ -d /nfs/slurm ]; then
exit 0;
else
sleep 20;
exit 0;
fi
In the systemd service file:
ExecStartPre=/usr/local/bin/checkNFSMount
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
Read about ExecStartPre: here.

Related

Creating a service to launch and restart a Python script

On a Virtual Maching running on Centos7, using root privileges, I want to create a service to launch a Python (Python 3.9) webservice and restart it every hour.
Here's what i did:
creating a directory /etc/systemd/system/dlweb_doc_webservice.service.d
creating a file dlweb_db_generate_report.service like this
[Unit]
Description=dlweb_db_generate_report_webservice
After=network-online.target
Wants=network-online.target systemd-networkd-wait-online.service
[Service]
Type = simple
ExecStart=/usr/local/bin/python3.9 /opt/scripts/dlweb_db_generate_report_webservice/dlweb_db_generate_report.py
Restart=on-failure
RestartSec=5s
StartLimitIntervalSec=3600
StartLimitBurst=5
[Install]
WantedBy=multi-user.target
running systemctl daemon-reload
running systemctl --type=service --all doesn't display my service
and
running systemctl enable dlweb_db_generate_report_webservice returned a Failed to execute operation: No such file or directory
What did i miss or do wrong?

How to configure multiple telegraf.service file to run multiple telegraf instances

I have a Kuberenetes cluster and telegraf is running on each node. Telegraf is collecting data and storing into InfluxDB. Now I want to run another instance of telegraf which will use one of the pods namespace and collect stats from Apache server running inside the pod and store the stats in the same InfluxDB storage.
I followed this link (https://community.influxdata.com/t/multiple-telegraf-configs/245/6) but couldn't figure out how can I implement this in my setup.
I am using Debian GNU/Linux 9 (stretch) and telegraf_1.12.5-1.
I created two service file as follows:
cat /usr/lib/telegraf/scripts/telegraf.service
[Unit]
Description=The plugin-driven server agent for reporting metrics into InfluxDB
Documentation=https://github.com/influxdata/telegraf
After=network.target
[Service]
EnvironmentFile=-/etc/default/telegraf
User=telegraf
ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartForceExitStatus=SIGPIPE
KillMode=control-group
[Install]
WantedBy=multi-user.target
cat /usr/lib/telegraf/scripts/telegraf_xyz.service
[Unit]
Description=The plugin-driven server agent for reporting metrics into InfluxDB
Documentation=https://github.com/influxdata/telegraf
After=network.target
[Service]
EnvironmentFile=-/etc/default/telegraf_xyz
User=telegraf
ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartForceExitStatus=SIGPIPE
KillMode=control-group
[Install]
WantedBy=multi-user.target
But when I try to run the second instance it's giving error:
Failed to start telegraf_xyz.service: Unit telegraf_xyz.service not found.
What else changes I need to do as part of this? I see the telegraf.service file in many other locations (/sys/), I am not sure where else I need to configure the second telegraf instance. I am very new in this.
Is there any other better way to implement this in my setup?
NOTE: I have created two service file and able run that in my host. Now the real challange is running the instance in another net namespace. Can anyone help me to implement this?

Please put your service file here: /etc/systemd/system/. Then reload systemd with systemctl daemon-reload. Your service should now be found.

Create systemd unit in cloud-init and enable it

I've created the following systemd unit in the cloud-init file:
- path: /etc/systemd/system/multi-user.target.wants/docker-compose.service
owner: root:root
permissions: '0755'
content: |
[Unit]
Description=Docker Compose Boot Up
Requires=docker.service
After=docker.service
[Service]
Type=simple
ExecStart=/usr/local/bin/docker-compose -f /opt/data/docker-compose.yml up -d
Restart=always
RestartSec=30
[Install]
WantedBy=multi-user.target
When I try to run
sudo systemctl enable docker-compose.service
to create the symlink I get this:
Failed to execute operation: No such file or directory
However I'm sure that the file is under /etc/systemd/system/multi-user.target.wants

I had the same need, but I was working from a recipe that said to create /etc/systemd/system/unit.service and then do systemctl enable --now unit.
So I created the unit file with write_files and did the reload and enable in a text/x-shellscript part and that worked fine. (User scripts run last and in order, while I don't think there are guarantees about when the write_files key in the user-data is processed. I found out the hard way that it's before the user key so you can't set ownership to users that cloud-init creates).
I think runcmd entries are converted to user scripts and run in list order (either before or after the other user scripts), so if you don't like x-shellscript parts you can do the reload and enable that way. /var/log/cloud-init.log is where I check the order, there is probably a config file too.
Full disclosure: I forgot the systemctl daemon-reload command but it still worked. Actually there is a caveat against systemd manipulations from cloud-init because it's running under systemd itself and some systemd commands may wait for cloud-init to finish -- deadlock!

After unit file creation but before any manipulations with it systemd should be notified about the changes:
systemctl daemon-reload
So cloud-init YAML block creating docker-compose.service file should be followed by:
runcmd:
- systemctl daemon-reload

Check that every file involved is present and valid:
ls -l /etc/systemd/system/multi-user.target.wants/docker-compose.service
ls -l /usr/local/bin/docker-compose
ls -l /opt/data/docker-compose.yml
systemd-analyze verify /etc/systemd/system/multi-user.target.wants/docker-compose.service
Also consider the timing. Even if the files exist once they are fully booted, would /etc/systemd/system/multi-user.target.wants/ exist when cloud-init runs?

How to download Kubernetes with Systemd at CoreOS

I am provisioning a cluster of CoreOS machines. But I am having trouble downloading the kubernetes tar ball because of its size (~450MB). I have managed to use this same techinique to download the latest etcd2, fleet and flannel, but when downloading such a big file as kubernetes my service fails or stop without any stack strace. It think is something related with the fact systemd is neither waiting nor restarting the service as I would expect.This is my service file:
[Unit]
Description=updates kubernetes v1.2
[Service]
Type=oneshot
User=root
WorkingDirectory=/home/core
ExecStart=/usr/bin/mkdir -p /opt/bin
ExecStart=/usr/bin/mkdir -p /home/core/kubernetes
ExecStart=/bin/wget https://github.com/kubernetes/kubernetes/releases/download/v1.2.0/kubernetes.tar.gz
ExecStart=/usr/bin/tar zxf /home/core/kubernetes -C /home/core/kubernetes --strip-components=1
ExecStart=/usr/bin/mv kubernetes/platforms/linux/amd64/kubectl /opt/bin/kubectl
ExecStart=/usr/bin/tar zxf kubernetes/server/kubernetes-server-linux-amd64.tar.gz
ExecStart=/usr/bin/chmod a+x kubernetes/server/bin/*
ExecStart=/usr/bin/mv kubernetes/server/bin/* /opt/bin
ExecStart=/usr/bin/rm -f /home/core/kubernetes

I bet you need to set/increase the TimeoutStartSec= parameter which is probably defaulted to 30 seconds or something like that.

Docker and systemd - service stopping after 10 seconds

I'm having trouble getting a Docker container to stay up when it's started by systemd. When I start it manually with sudo docker start containername, it stays up without trouble, but when it's started via systemd with sudo systemctl start containername, it stays up for 10 seconds then mysteriously dies, leaving messages in syslog something like the following:
Mar 13 14:01:09 hostname docker[329]: time="2015-03-13T14:01:09Z" level="info" msg="POST /v1.17/containers/containername/stop?t=10"
Mar 13 14:01:09 hostname docker[329]: time="2015-03-13T14:01:09Z" level="info" msg="+job stop(containername)"
I am making the assumption that it's systemd killing the process, but I can't work out why it might be happening. The systemd unit file (/etc/systemd/system/containername.service) is pretty simple, as follows:
[Unit]
Description=MyContainer
After=docker.service
Requires=docker.service
[Service]
ExecStart=/usr/bin/docker start containername
ExecStop=/usr/bin/docker stop containername
[Install]
WantedBy=multi-user.target
Docker starts fine on boot, and it looks like it does even start the docker container, but no matter if on boot or manually, it then quits after exactly 10 seconds. Help gratefully received!

Solution: The start command seems to need the -a (attach) parameter as described in the documentation when used in a systemd script. I assume this is because it by default forks to the background, although the systemd expect daemon feature doesn't appear to fix the issue.
from the docker-start manpage:
-a, --attach=true|false
Attach container's STDOUT and STDERR and forward all signals to the process. The default is false.
The whole systemd script then becomes:
[Unit]
Description=MyContainer
After=docker.service
Requires=docker.service
[Service]
ExecStart=/usr/bin/docker start -a containername
ExecStop=/usr/bin/docker stop containername
[Install]
WantedBy=multi-user.target