Cannot allocate GPU in Slurm - distributed-computing

I've got a problem to allocate gpu resourese at Slurm cluster.
specify 1 GPU and run as shown below, it says that gres resources cannot be allocated. The same result If more than one.
$ srun --gres=gpu:1 --pty bash
srun: error: Unable to create step for job 73: Invalid generic resource (gres) specification
compute node's gres information seems to come out correctly as below
$ sinfo -o "%20N %10c %10m %25f %10G "
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
gpu_svr[1-4 72 515484 (null) gpu:8
The Node configuration in the slurm.conf as below
/etc/slurm/slurm.conf
GresTypes=gpu
NodeName=gpu_svr1 NodeAddr=x.x.x.1 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
NodeName=gpu_svr2 NodeAddr=x.x.x.2 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
NodeName=gpu_svr3 NodeAddr=x.x.x.3 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
NodeName=gpu_svr4 NodeAddr=x.x.x.4 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
PartitionName=v100 Nodes=ALL Default=YES MaxTime=INFINITE State=UP
here is gres.conf on Compute nodes
gres.conf
NodeName=gpu_svr[1-4] Name=gpu File=/dev/nvidia[0-7]

Solved.
The following options should be stated in the slurm.conf
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
JobAcctGatherType=jobacct_gather/cgroup

Related

containerd - cannot update memory of a running container lower than its current memory

I am using 'crictl' tool to work with containerd runtime containers (under kubernetes) in a managed cluster.
I'm trying to set the memory limit (in bytes) to 16MB with the command:
crictl -r unix:///run/containerd/containerd.sock update --memory 16777216 c60df9ef3381e
And get the following error:
E1219 11:10:11.616194 1241 remote_runtime.go:640] "UpdateContainerResources from runtime service failed" err=<
rpc error: code = Unknown desc = failed to update resources: failed to update resources: /usr/bin/runc did not terminate successfully: exit status 1: unable to set memory limit to 16777216 (current usage: 97058816, peak usage: 126517248)
: unknown
> containerID="c60df9ef3381e"
FATA[0000] updating container resources for "c60df9ef3381e": rpc error: code = Unknown desc = failed to update resources: failed to update resources: /usr/bin/runc did not terminate successfully: exit status 1: unable to set memory limit to 16777216 (current usage: 97058816, peak usage: 126517248)
: unknown
At first I thought that maybe I cannot set a memory limit directly to a running container lower than the limit that appears in the kubernetes yaml.
Here Are the limits from K8s:
Requests:{"cpu":"100m","memory":"64Mi"} Limits:{"cpu":"200m","memory":"128Mi"}
But not, even setting a memory limit above the K8S request (e.g. 65MB) gives this same error!
This works on Docker runtime - I'm able to limit the memory of the container. Yes, it might crash, but the operation works..
Then, I tried to give a memory limit higher than the current usage, and it succeeded...
Can anyone help understanding this error and what might be causing it on containerd runtime?? Is this indeed a limitation that I cannot limit to a lower memory currently used by the container? Is there a way to overcome that?
Thanks a lot for your time!!!

Enable custom kubernetes scheduler for a namespace

I have a k8 job that brings up multiple pods. This job is used for load testing so all the pods need to come up at the same time. Job shouldn't be started until nodes are available for all pods to be scheduled.
I came across kube-batch https://github.com/kubernetes-sigs/kube-batch to do this scheduling. I have couple of questions:
1. How to enable kube-batch for only one namespace in a cluster?
2. Installed kube-batch by following the tutorial. But pods are failing on startup with below error. How to resolve this error?
I1204 20:07:55.911393 1 allocate.go:96] Queue <default> is overused, ignore it.
I1204 20:07:55.911399 1 allocate.go:194] Leaving Allocate ...
I1204 20:07:55.911407 1 backfill.go:41] Enter Backfill ...
I1204 20:07:55.911413 1 backfill.go:71] Leaving Backfill ...
E1204 20:07:55.911521 1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/usr/local/go/src/runtime/panic.go:82
/usr/local/go/src/runtime/signal_unix.go:390
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/framework/session.go:368
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/plugins/gang/gang.go:154
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/framework/framework.go:58
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:102
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:85
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1333
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x148 pc=0x10ab979]
Not sure what you are trying to achive is doable. In my opinion what you can do is to modify the pods dockerfile to include Supervisord . Then in supervisord specify the commands you want to run when the pods come in running state using priority for supervisord.
Example
[program:api]
directory=/usr/local
command=go main.go
priority=100
autostart=true
autorestart=true
stderr_logfile=/var/log
stdout_logfile=/var/log

Enable FTRACE in ARM to trace realtime characteristic of system

I followed this article to enable FTRACE
https://lwn.net/Articles/365835/
to test a realtime system, my system uses arm cortexa15 (Description: https://mp.renesas.com/en-us/rzg/marketplace/board/RZGB000003.html)
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_STACK_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
But, it didn't work, caused the system hang-up when starting kernel.
Even referred How to Enable or configure ftrace module
I would like to test latency in the realtime system with cyclictest (option -b to trigger FTRACER)
cyclictest -a -t -n -p99 -f -b100
It generated dump message:
INFO: debugfs mountpoint: /sys/kernel/debug/tracing/
WARN: tracing_enabled or tracing_on not found
debug fs not mounted, TRACERs not configured?
could not set ftrace_enabled to 0
FATAL: Can't open /sys/kernel/debug/tracing/available_tracers for reading
I repeated a next step to enable a group of tracer configs:
CONFIG_FTRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FUNCTION_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_PREEMPT_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_ENABLE_DEFAULT_TRACERS=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_TRACEPOINT_BENCHMARK=y
CONFIG_BACKTRACE_SELF_TEST=y
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_LL=y
The result still was the same. Kernel hung and didn't show anything.
Anyone who deal with realtime system and Ftrace can help ? Thanks.
I resolve my problem. Below is a part of my defconfig file.
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACEPOINTS=y
CONFIG_STACKTRACE=y
CONFIG_NOP_TRACER=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_GENERIC_TRACER=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
CONFIG_STACK_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
After enabling Ftrace tool, the culprit can then be found in the trace output at /sys/kernel/debug/tracing/trace. The kernel function that was executed just before a latency of more than 100 microseconds was detected is marked with an exclamation mark.

red hat OSP10 deploy fails on node profile tag, even though it is configured

I am trying to deploy rhosp10, and when getting to "openstack overcloud deploy" phase, I get these errors:
Error: only 0 of 1 requested ironic nodes are tagged to profile control (for flavor control)
Recommendation: tag more nodes using ironic node-update <NODE ID> replace properties/capabilities=profile:control,boot_option:local
Error: only 0 of 5 requested ironic nodes are tagged to profile compute (for flavor compute)
Recommendation: tag more nodes using ironic node-update <NODE ID> replace properties/capabilities=profile:compute,boot_option:local
Not enough nodes - available: 0, requested: 6
Configuration has 3 errors, fix them before proceeding. Ignoring these errors is likely to lead to a failed deploy.
However, I configured 1 node to use control profile, and 5 to use compute profile. For example:
[stack#rhosp-1-director ~]$ openstack baremetal node show 4e153e0a-4c7b-4ee9-afb5-9036e263949b|grep prop
| properties | {u'cpu_arch': u'x86_64', u'root_device': {u'serial': u'600508b1001c7b0731bc32edbb3a8369'}, u'cpus': u'48', u'capabilities': u'profile:control,boot_option:local', u'memory_mb': u'131072', u'local_gb': u'744'} |
[stack#rhosp-1-director ~]$ openstack baremetal node show 4989038d-de10-4365-8051-44fd42fd0ec7|grep prop
| properties | {u'cpu_arch': u'x86_64', u'root_device': {u'serial': u'600508b1001c73b9fa55f385cd1a4008'}, u'cpus': u'48', u'capabilities': u'profile:compute,boot_option:local', u'memory_mb': u'131072', u'local_gb': u'744'} |
Another thing is that the following command yields no output:
openstack overcloud profiles list
I am following their manual from https://access.redhat.com/documentation/en/red-hat-openstack-platform/10-beta/single/director-installation-and-usage/#sect-Registering_Nodes_for_the_Overcloud step by step, so don't know what I'm doing wrong.
problem ended up being ironic auto cleaning. introspection never completed ok. not sure why, but disabling auto cleaning in ironic.conf right after undercloud install, followed by a reboot (for all ironic services to update this property), followed by the next steps, was successful (including introspection).

Link aggregation and status of network interfaces in "ipadm" command

I am again rephrasing the issue that we are facing:
We are creating link aggregations [dlmp groups] with two interfaces named net0 & net5:
# dladm create-aggr -m dlmp -l net0 -l net5 -l net2 aggr1
Setting prob targets for aggr1:
# dladm set-linkprop -p probe-ip=+ aggr1
Setting failure detection time:
# dladm set-linkprop -p probe-fdt=15 aggr1
After this we are adding IP to this aggregation as follows:
# ipadm create-ip aggr1
Assigns an IP to this:
# ipadm create-addr -T static -a x.x.x.x/y aggr1/addr
Then we check the status using dladm and ipadm everything seems up and running.
Then we tested a scenario where we are dettached cables from above n/w interfaces, but what we got is as follows:
# dladm show-aggr -x
LINK PORT SPEED DUPLEX STATE ADDRESS PORTSTATE
traf0 -- 100Mb unknown up 0:10:e0:5b:69:1 --
net0 100Mb unknown down 0:10:e0:5b:69:1 attached
net5 100Mb unknown down a0:36:9f:45:de:9d attached
First issues is that we are getting the state of link "traf0" as up in above command output, secondly in the output of "ipadm":
traf0 ip ok -- --
traf0/addr static ok -- 7.8.0.199/16
We are getting the status of traf0 as ok.
So here I have a query, can't we have any configuration where we could get right status of traf0 both in dladm and ipadm output?
[One more thing to add here is, when we don't assign any IP to this traf0 aggregation then in that case on dettaching the cables we get right output in dladm command.]
Apart from this configuration, we are using these aggregations as vnics in zones. There also we are getting the status of these links up in ipadm command output [after dettaching the cables].
A small update::
We have set the value of "TRACK_INTERFACES_ONLY_WITH_GROUPS" parameter in /etc/default/mpathd as no and getting the state of "traf0" in ipadm command as failed, but still we get traf0/addr as ok.
traf0 ip failed -- --
traf0/addr static ok -- 7.8.0.199/16