Cannot enforce memory limits in SLURM - hpc

I am using Slurm on a single node (control and compute) and I cannot seem to correctly limit memory. The script seems to call SBATCH with small memory values (3G), but I see values in top that exceed 25G. Sacct gives me the correct values:
squeue -o "%C %m"
CPUS MIN_MEMORY
2 3G
This is my slurm.conf:
#
SlurmctldHost=schopenhauer
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
#ProctrackType=proctrack/cgroup
ProctrackType=proctrack/linuxproc
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300000
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
AccountingStorageLoc=/var/log/slurm/slurm_jobacct.log
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/filetxt
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
JobCompLoc=/var/log/slurm/slurm_jobcomp.log
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/filetxt
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurm/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=schopenhauer CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 State=UNKNOWN
PartitionName=short Nodes=schopenhauer Default=YES MaxTime=INFINITE State=UP
Did I misunderstand something? Why does it say minimum memory when I want to make that minimum and maximum as well?
EDIT: I just noticed by setting required memory to a larger one that this doesn't work as a minimum either i.e. many tasks were started even though there was enough RAM for only 12 of them (I requested 40G and I have 500G). Is this the same problem?

Slurm controls memory through the Linux cgroup functionality. You need to set TaskPlugin=task/cgroup in slurm.conf (Cf. https://slurm.schedmd.com/cgroups.html) and ConstrainRAMSpace=yes in cgroup.conf (Cf. https://slurm.schedmd.com/cgroup.conf.html). Then the memory requested by jobs with --mem or --mem-per-cpu becomes effectively a hard limit in addition to being a resource request.
The -m option of gives the memory requested by the job. As a request, it is considered a minimum requirement. But if you configure cgroup it effectively also becomes the maximum.

I don't think slurm enforces memory or cpu usage. It's just there as indication what you think your job's usage will be. To set binding memory you could use ulimit, something like ulimit -v 3G at the beginning of your script.
Just know that this will likely cause problems with your program as it actually requires the amount of memory it requests, so it won't finish succesfully.

Related

what is `__GI_memset`? why does it cost so much CPU resource?

I'm new to perf, and I'm trying to use it to analyse my programme.
and I got this when running perf top:
PerfTop: 296 irqs/sec kernel:62.8% exact: 0.0% [1000Hz cycles:ppp], (all, 6 CPUs)
-----------------------------------------------------------------------------------------------------------------------
65.43% libc-2.23.so [.] __GI_memset
1.55% libopencv_imgcodecs.so.4.4.0 [.] cv::icvCvt_BGR2RGB_8u_C3R
1.54% libc-2.23.so [.] malloc
1.32% libc-2.23.so [.] _int_free
0.92% [kernel] [k] clear_page
0.91% libjpeg.so.8.0.2 [.] 0x000000000001b828
0.90% libc-2.23.so [.] memcpy
so, I just wonder what cost my 65% of CPU resource, is it really just memset in libc?
if it is, how come it cost this much?
what is __GI_memset?
It's an internal alias for memset.
why does it cost so much CPU resource?
Because you call it a lot, or because you give it a lot of memory set to some value.
Judging by your next most expensive symbol cv::icvCvt_BGR2RGB_8u_C3R, you are doing some kind of image processing, and possibly are allocating cleared images.
One common mistake is to allocate a cleared image and immediately set it something else (thus wasting the time spent clearing it). But there is not enough info here to deduce whether you are doing that here.

SB37 JCL error to small a size?

The following space allocation is giving me an sB37 JCL error. The cobol size of the output file is 100 bytes and the lrecl size is 100 bytes. What do you think is causing this error? I have tried increase the size to 500,100 and still get the same error.
Code:
//OUTPUT1 DD DSN=A.B.C,DISP=(NEW,CATLG,DELETE),
// DCB=(LRECL=100,BLKSIZE=,RECFM=FBM),
// SPACE=(CYL,(10,5),RLSE)
Try to increase not only the space, but the volume as well.
Include VOL=(,,,#) in your DD. # is the numbers of values you want to allocate
Ex: SPACE=(CYL,(10,5),RLSE),VOL=(,,,3) - includes 3 volumes.
Additionally, you can increase the size, but try to stay within reasonable limits :)
The documentation for B37 says the application programmer should respond as indicated for message IEC030I. The documentation for IEC030I says, in part...
Probable user error. For all cases, allocate as many units as volumes
required.
...as noted in another answer. However, be advised that the documentation for the VOL parameter of the DD statement says...
If you omit the volume count or if you specify 1 through 5, the system
allows up to five volumes; if you specify 6 through 20, the system
allows 20 volumes; if you specify a count greater than 20, the system
allows 5 plus a multiple of 15 volumes. You can override the maximum
volume count in data class by using the volume-count subparameter. The
maximum volume count for an SMS-managed mountable tape data set or a
Non-managed tape data set is 255.
...so for DASD allocations you are best served specifying a volume count greater than 5 (at least).
//OUTPUT1 DD DSN=A.B.C,DISP=(NEW,CATLG,DELETE),
// DCB=(LRECL=100,BLKSIZE=,RECFM=FBM),
// SPACE=(CYL,(10,5),RLSE)
Try this instead. Notice that the secondary will take advantage of a large dataset whereas without that parameter the largest secondary that makes any sense is < 300. Oh, and if indeed it is from a COBOL program make sure that the FD says "BLOCK 0"!!!!! If it isn't "BLOCK 0" then you might not even need to change your JCL because it wasn't fixed block machine. It was merely fixed and unblocked so the space would almost never be enough. And finally you may wish to revisit why you have the M in the RECFM to begin with. Notice also that I took out the LRECL, the BLKSIZE and the RECFM. That is because the FD in the COBOL program is all you need and putting it in the JCL is not only redundant but dangerous because any change will have to be now done in multiple places.
//OUTPUT1 DD DSN=A.B.C,DISP=(NEW,CATLG,DELETE),
// DSNTYPE=LARGE,UNIT=(SYSALLDA,59),
// SPACE=(CYL,(10,1000),RLSE)
There is a limit of 65,535 tracks per one volume. So if you will specify a SPACE that exceeds that limit - system will simply ignore it.
You can increase this limit to 16,777,215 tracks by adding DSNTYPE=LARGE paramter.
Or you can specify that your dataset is a multi volume by adding VOL=(,,,3)
You can also use DATACLAS=xxxx paramter here, however first of all you need to find it. Easy way is to contact your local Storage Team and ask for one. Or If you are familiar with ISPF navigation, you can enter ISMF;4 command to open a panel
use bellow paramters before hitting enter.
CDS Name . . . . . . 'ACTIVE'
Data Class Name . . *
It should produce a list of all available data classes. Find the one that suits you ( has enougth amount of volume count, does not limit primary and secondary space

How does Solaris SMF determine if something is to be in maintenance or to be restarted?

I have a daemon process that I wrote being executed by SMF. The problem is when an error occurs, I have fail code and then it will need to restart from scratch. Right now it is sending sys.exit(0) (Python), but SMF keeps throwing it in maintenance mode.
I've worked with SMF enough to know that it sometimes auto-restarts certain services (and lets others fail and have you deal with them like this). How do I classify this process as one that needs to auto-restart? Is it an SMF setting, a method of failing, what?
Manpage
Solaris uses a combination of startd/critical_failure_count and startd/critical_failure_period as described in the svc.startd manpage:
startd/critical_failure_count
startd/critical_failure_period
The critical_failure_count and critical_failure_period properties together specify the maximum number of service failures allowed in a given time interval before svc.startd transitions the service to maintenance. If the number of failures exceeds critical_failure_count in any period of critical_failure_period seconds, svc.startd will transition the service to maintenance.
Defaults in the source code
The defaults can be found in the source, the value depends on whether the service is "wait style":
if (instance_is_wait_style(inst))
critical_failure_period = RINST_WT_SVC_FAILURE_RATE_NS;
else
critical_failure_period = RINST_FAILURE_RATE_NS;
The defaults are either 5 failures/10 minutes or 5 failures/second:
#define RINST_START_TIMES 5 /* failures to consider */
#define RINST_FAILURE_RATE_NS 600000000000LL /* 1 failure/10 minutes */
#define RINST_WT_SVC_FAILURE_RATE_NS NANOSEC /* 1 failure/second */
These variables can be set in the SMF as properties:
<service_bundle type="manifest" name="npm2es">
<service name="site/npm2es" type="service" version="1">
...
<property_group name="startd" type="framework">
<propval name='critical_failure_count' type='integer' value='10'/>
<propval name='critical_failure_period' type='integer' value='30'/>
<propval name="ignore_error" type="astring" value="core,signal" />
</property_group>
...
</service>
</service_bundle>
TL;DR
After checking against the startd values, If the service is "wait style", it will be throttled to a max restart of 1/sec, until it no longer exits with a non-cfg error. If the service is not "wait style" it will be put into maintenance mode.
Presuming a normal service manifest, I would suspect that you're dropping into maintenance because SMF is restarting you "too quickly" (which is a bit arbitrarily defined). svcs -xv should tell you if that is the case. If it is, SMF is restarting you, and then you're exiting again rapidly and it's decided to give up until the problem is fixed (and you've manually svcadm clear'd it.
I'd wondered if exiting 0 (and indicating success) may cause further confusion, but it doesn't appear that it will.
I don't think Oracle Solaris allows you to tune what SMF considers "too quickly".
You have to create a service manifest. This is more complicated than not. This has example manifests and documents the manifest structure.
http://www.oracle.com/technetwork/server-storage/solaris/solaris-smf-manifest-wp-167902.pdf
As it turns out, I had two pkills in a row to make sure everything was terminated correctly. The second one, naturally, was exiting something other than 0. Changing this to include an exit 0 at the end of the script solved the problem.

Getting the IO count

I am using xen hypervisor. I am trying to get the IO count of the VMs running on top of the xen hypervisor. Can someone suggest me some way or tool to get the IO count ? I tried using xenmon and virt-top. Virt-top doesnt give any value and xenmon always shows 0. Any suggestions to get the number of read or write calls made by a VM or the read and write(Block IO) bandwidth of a particular VM. Thanks !
Regards,
Sethu
You can read this directly from sysfs on most systems. You want to open the following directory:
/sys/devices/xen-backend
And look for directories starting with vbd-
The nomenclature is:
vbd-{domain_id}-{vbd_id}/statistics
Inside, you'll find what you need, which is:
br_req - Number of block read requests
oo_req - Number of 'out of' requests (no room left in list to service any given request)
rd_req - Number of read requests
rd_sect - Number of sectors read
wr_sect - Number of sectors written
The br_req will be an aggregate count of things like write barriers, aborts, etc.
Note, for this to work, The kernel has to be told to export Xen attributes via sysfs, but most Xen packages have this enabled. Additionally, the location in sysfs might be different with earlier versions of Xen.
have you tried xentop?
There is also bwm-ng (check your distro). It shows block utilization per disk (real/virtual). If you know the name of the virtual disk attached to the VM, then you can use bwm-ng to get those stats.

Max number of socket on Linux

It seems that the server is limited at ~32720 sockets...
I have tried every known variable change to raise up this limit.
But the server stay limited at 32720 opened socket, even if there is still 4Go of free memory and 80% of idle cpu...
Here's the configuration
~# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 63931
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 798621
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 2048
cpu time (seconds, -t) unlimited
max user processes (-u) 63931
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
net.netfilter.nf_conntrack_max = 999999
net.ipv4.netfilter.ip_conntrack_max = 999999
net.nf_conntrack_max = 999999
Any thoughts ?
If you're dealing with openssl and threads, go check your /proc/sys/vm/max_map_count and try to raise it.
In IPV4, the TCP layer has 16 bits for the destination port, and 16 bits for the source port.
see http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Seeing that your limit is 32K I would expect that you are actually seeing the limit of outbound TCP connections you can make. You should be able to get a max of 65K sockets (this would be the protocol limit). This is the limit for total number of named connections. Fortunately, binding a port for incoming connections only uses 1. But if you are trying to test the number of connections from the same machine, you can only have 65K total outgoing connections (for TCP). To test the amount of incoming connections, you will need multiple computers.
Note: you can call socket(AF_INET,...) up to the number of file descriptors available, but
you cannot bind them without increasing the number of ports available. To increase the range, do this:
echo "1024 65535" > /proc/sys/net/ipv4/ip_local_port_range
(cat it to see what you currently have--the default is 32768 to 61000)
Perhaps it is time for a new TCP like protocol that will allow 32 bits for the source and dest ports? But how many applications really need more than 65 thousand outbound connections?
The following will allow 100,000 incoming connections on linux mint 16 (64 bit)
(you must run it as root to set the limits)
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/ip.h>
void ShowLimit()
{
rlimit lim;
int err=getrlimit(RLIMIT_NOFILE,&lim);
printf("%1d limit: %1ld,%1ld\n",err,lim.rlim_cur,lim.rlim_max);
}
main()
{
ShowLimit();
rlimit lim;
lim.rlim_cur=100000;
lim.rlim_max=100000;
int err=setrlimit(RLIMIT_NOFILE,&lim);
printf("set returned %1d\n",err);
ShowLimit();
int sock=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
sockaddr_in maddr;
maddr.sin_family=AF_INET;
maddr.sin_port=htons(80);
maddr.sin_addr.s_addr=INADDR_ANY;
err=bind(sock,(sockaddr *) &maddr, sizeof(maddr));
err=listen(sock,1024);
int sockets=0;
while(true)
{
sockaddr_in raddr;
socklen_t rlen=sizeof(raddr);
err=accept(sock,(sockaddr *) &raddr,&rlen);
if(err>=0)
{
++sockets;
printf("%1d sockets accepted\n",sockets);
}
}
}
Which server are you talking about ? It might be it has a hardcoded max, or runs into other limits (max threads/out of address space etc.)
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1 has some tuning to needed to achieve a lot of connection, but it doesn't help if the server application limits it in some way or another.
Check the real limits of the running process with.
cat /proc/{pid}/limits
The max for nofiles is determined by the Kernel, the following as root would increase the max to 100,000 "files" i.e. 100k CC
echo 100000 > /proc/sys/fs/file-max
To make it permanent edit /etc/sysctl.conf
fs.file-max = 100000
You then need the server to ask for more open files, this is different per server. In nginx, for example, you set
worker_rlimit_nofile 100000;
Reboot nginx and check /proc/{pid}/limits
To test this you need 100,000 sockets in your client, you are limited in the testing to the number of ports in TCP per IP address.
To increase the local port range to maximum...
echo "1024 65535" > /proc/sys/net/ipv4/ip_local_port_range
This gives you ~64000 ports to test with.
If that is not enough, you need more IP addresses. When testing on localhost you can bind the source/client to an IP other than 127.0.0.1 / localhost.
For example you can bind your test clients to IPs randomly selected from 127.0.0.1 to 127.0.0.5
Using apache-bench you would set
-B 127.0.0.x
Nodejs sockets would use
localAddress
/etc/security/limits.conf configures PAM: its usually irrelevant for a server.
If the server is proxying requests using TCP, using upstream or mod_proxy for example, the server is limited by ip_local_port_range. This could easily be the 32,000 limit.
If you're considering an application where you believe you need to open thousands of sockets, you will definitely want to read about The C10k Problem. That page discusses many of the issues you will face as you scale up your number of client connections to a single server.
On Gnu+Linux, maximum is what you wrote. This number is (probably) stated somewhere in networking standards. I doubt you really need so many sockets. You should optimize the way you are using sockets instead of creating dozens all the time.
In net/socket.c the fd is allocated in sock_alloc_fd(), which calls get_unused_fd().
Looking at linux/fs/file.c, the only limit to the number of fd's is sysctl_nr_open, which is limited to
int sysctl_nr_open_max = 1024 * 1024; /* raised later */
/// later...
sysctl_nr_open_max = min((size_t)INT_MAX, ~(size_t)0/sizeof(void *)) &
-BITS_PER_LONG;
and can be read using sysctl fs.nr_open which gives 1M by default here. So the fd's are probably not your problem.
edit you then probably checked this as well, but would you care to share the output of
#include <sys/time.h>
#include <sys/resource.h>
int main() {
struct rlimit limit;
getrlimit(RLIMIT_NOFILE,&limit);
printf("cur: %d, max: %d\n",limit.rlim_cur,limit.rlim_max);
}
with us?
Generally having too much live connections is a bad thing. However, everything depends on the application and the patterns it communicates with its clients.
I suppose there is a pattern when clients have to be permanently async-connected and it is the only way a distributed solution might work.
Assumimg there are no bottlenecks in memory/cpu/network for the current load, and keeping in mind that to leave idle open connection is the only way distributed applications consumes less resources (say, connection time, and the overall/peak memory), overall OS network performance might be higher than using best practices we all know.
Good question and it needs for a solution. The problem is nobody can answer this. I would suggest to use divide & conquer technique and when the bottleneck is found return to us.
Please take apart your application on testbed and you will find the bottleneck.