Slurm AccountingStorageEnforce=associations without effect - hpc

I'm managing a cluster with slurm and slurmdbd (with MySQL)
I have set the following option in /etc/slurm/slurm.conf
AccountingStorageEnforce=associations
I have reloaded the config:
scontrol reconfig
I have configured some associations in sacctmgr.
The problem is that I'm still able to submit a job with a non existing slurm user. What I have understood of the option AccountingStorageEnforce=associations is that a non existing user won't have the right to submit a job.

It seems that for this kind of modification to be taken into account, you need to do a full restart of slurm daemon, a reload is not sufficient.

The slurm.conf man page states that
AccountingStorageEnforce
This controls what level of association-based enforcement to impose on job submissions. Valid options are any combination of associations, limits, nojobs, nosteps,
qos, safe, and wckeys, or all for all things.
[...]
When AccountingStorageEnforce is changed, a restart of the slurmctld daemon is required (not just a "scontrol reconfig").
so you have to make sure to restart the slurm controller daemon (and the secondary controller if configured.)

Related

Is it recommended to set MARKLOGIC_USER from the default daemon to a named user?

By default, after Marklogic default installation on Centos, ML will starts under daemon user.
Everything works fine. Except that I could not make DB backup.
After research, I found below KB.
https://docs.marklogic.com/guide/installation/procedures#id_32108
I wonder whether it is recommended to always set up MARKLOGIC_USER to a named user for Linux Installation.
I guess running ML in production, ease of ML upgrade should be important.
Whether or not to run the MarkLogic process as the default daemon or a different specified user is a matter of preference. Though, it is generally considered a best practice to run applications and services a specified user.
https://refspecs.linuxbase.org/LSB_3.1.1/LSB-Core-generic/LSB-Core-generic/usernames.html
The daemon User ID/Group ID was used as an unprivileged User ID/Group ID for daemons to execute under in order to limit their access to the system. Generally daemons should now run under individual User ID/Group IDs in order to further partition daemons from one another.
Though, the daemon user is provided by default. If you configure MarkLogic to run as a different user, you need to ensure that user is created and provisioned properly.
The error that you encountered when running the backup was because the daemon user didn't have permission to create the backup directory.
You can address that by adjusting the filesystem permissions and continue to run the MarkLogic process as daemon. If you choose to run the process as a different user, you still need to ensure that the chosen user has the necessary permissions to create files and directories in order to perform a backup.

Why Systemd remove my SHM file but not Postgresql's?

I'm developing a daemon application on Ubuntu server that being managed by Systemd.
I create a SHM file in /dev/shm/ by using shm_open, and close the file descriptor after calling to mmap. At the beginning it exists, but it disappeared after a time, maybe as I loged out from the server.
Perhaps this is controlled by the option RemoveIPC=yes in /etc/systemd/logind.conf.
My question is
Why does systemd not clean up the shm file created by Postgresql, but mine?
How to modify my app to make it like Postgresql, so that we can reduce the managing/maintaining work at the producing time.
I found that the shm memory is still available after it be cleaned by systemd. Does this mean that I can ignore that, and continue to use it without recreating?
I think your suspicion is right; see the documentation for details:
If systemd is in use, some care must be taken that IPC resources (including shared memory) are not prematurely removed by the operating system. This is especially of concern when installing PostgreSQL from source. Users of distribution packages of PostgreSQL are less likely to be affected, as the postgres user is then normally created as a system user.
The setting RemoveIPC in logind.conf controls whether IPC objects are removed when a user fully logs out. System users are exempt. This setting defaults to on in stock systemd, but some operating system distributions default it to off.
[...]
A “user logging out” might happen as part of a maintenance job or manually when an administrator logs in as the postgres user or something similar, so it is hard to prevent in general.
What is a “system user” is determined at systemd compile time from the SYS_UID_MAX setting in /etc/login.defs.
Packaging and deployment scripts should be careful to create the postgres user as a system user by using useradd -r, adduser --system, or equivalent.
Alternatively, if the user account was created incorrectly or cannot be changed, it is recommended to set
RemoveIPC=no
in /etc/systemd/logind.conf or another appropriate configuration file.
While this is talking about PostgreSQL, the same applies to your software. So take one of the recommended measures.

Statefulset - Possible to Skip creation of pod 0 when it fails and proceed with the next one?

I currently do have a problem with the statefulset under the following condition:
I have a percona SQL cluster running with persistent storage and 2 nodes
now i do force both pods to fail.
first i will force pod-0 to fail
Afterwards i will force pod-1 to fail
Now the cluster is not able to recover without manual interference and possible dataloss
Why:
The statefulset is trying to bring pod-0 up first, however this one will not be brought online because of the following message:
[ERROR] WSREP: It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1
What i could do alternatively, but what i dont really like:
I could change ".spec.podManagementPolicy" to "Parallel" but this could lead to race conditions when forming the cluster. Thus i would like to avoid that, i basically like the idea of starting the nodes one after another
What i would like to have:
the possibility to have ".spec.podManagementPolicy":"OrderedReady" activated but with the possibility to adjust the order somehow
to be able to put specific pods into "inactive" mode so they are being ignored until i enable them again
Is something like that available? Does someone have any other ideas?
Unfortunately, nothing like that is available in standard functions of Kubernetes.
I see only 2 options here:
Use InitContainers to somehow check the current state on relaunch.
That will allow you to run any code before the primary container is started so you can try to use a custom script in order to resolve the problem etc.
Modify the database startup script to allow it to wait for some Environment Variable or any flag file and use PostStart hook to check the state before running a database.
But in both options, you have to write your own logic of startup order.

When marathon restarts process, possible to pass different command-line flag?

I notice that when I am running a process under marathon and I restart it, the process automatically starts back up. The way the logic of the process works, if it is restarted, it enters a recovery mode where it tries to replay its state. The recovery mode is entered when a command-line flag is seen, such as "-r". I want to append this flag to cmd command that is initially used during startup in marathon. Is there an option somewhere in marathon for this capability?
I solved my issue by using event subscriber in marathon. By using PUT with curl rather than POST, you are able to modify a deployment rather than recreating a brand new one with POST.

Persistent storage for Apache Mesos

Recently I've discovered such a thing as a Apache Mesos.
It all looks amazingly in all that demos and examples. I could easily imagine how one would run for stateless jobs - that fits to the whole idea naturally.
Bot how to deal with long running jobs that are stateful?
Say, I have a cluster that consists of N machines (and that is scheduled via Marathon). And I want to run a postgresql server there.
That's it - at first I don't even want it to be highly available, but just simply a single job (actually Dockerized) that hosts a postgresql server.
1- How would one organize it? Constraint a server to a particular cluster node? Use some distributed FS?
2- DRBD, MooseFS, GlusterFS, NFS, CephFS, which one of those play well with Mesos and services like postgres? (I'm thinking here on the possibility that Mesos/marathon could relocate the service if goes down)
3- Please tell if my approach is wrong in terms of philosophy (DFS for data servers and some kind of switchover for servers like postgres on the top of Mesos)
Question largely copied from Persistent storage for Apache Mesos, asked by zerkms on Programmers Stack Exchange.
Excellent question. Here are a few upcoming features in Mesos to improve support for stateful services, and corresponding current workarounds.
Persistent volumes (0.23): When launching a task, you can create a volume that exists outside of the task's sandbox and will persist on the node even after the task dies/completes. When the task exits, its resources -- including the persistent volume -- can be offered back to the framework, so that the framework can launch the same task again, launch a recovery task, or launch a new task that consumes the previous task's output as its input.
Current workaround: Persist your state in some known location outside the sandbox, and have your tasks try to recover it manually. Maybe persist it in a distributed filesystem/database, so that it can be accessed from any node.
Disk Isolation (0.22): Enforce disk quota limits on sandboxes as well as persistent volumes. This ensures that your storage-heavy framework won't be able to clog up the disk and prevent other tasks from running.
Current workaround: Monitor disk usage out of band, and run periodic cleanup jobs.
Dynamic Reservations (0.23): Upon launching a task, you can reserve the resources your task uses (including persistent volumes) to guarantee that they are offered back to you upon task exit, instead of going to whichever framework is furthest below its fair share.
Current workaround: Use the slave's --resources flag to statically reserve resources for your framework upon slave startup.
As for your specific use case and questions:
1a) How would one organize it? You could do this with Marathon, perhaps creating a separate Marathon instance for your stateful services, so that you can create static reservations for the 'stateful' role, such that only the stateful Marathon will be guaranteed those resources.
1b) Constraint a server to a particular cluster node? You can do this easily in Marathon, constraining an application to a specific hostname, or any node with a specific attribute value (e.g. NFS_Access=true). See Marathon Constraints. If you only wanted to run your tasks on a specific set of nodes, you would only need to create the static reservations on those nodes. And if you need discoverability of those nodes, you should check out Mesos-DNS and/or Marathon's HAProxy integration.
1c) Use some distributed FS? The data replication provided by many distributed filesystems would guarantee that your data can survive the failure of any single node. Persisting to a DFS would also provide more flexibility in where you can schedule your tasks, although at the cost of the difference in latency between network and local disk. Mesos has built-in support for fetching binaries from HDFS uris, and many customers use HDFS for passing executor binaries, config files, and input data to the slaves where their tasks will run.
2) DRBD, MooseFS, GlusterFS, NFS, CephFS? I've heard of customers using CephFS, HDFS, and MapRFS with Mesos. NFS would seem an easy fit too. It really doesn't matter to Mesos what you use as long as your task knows how to access it from whatever node where it's placed.
Hope that helps!