Slurm: how to use all cores available to the node? - queue

I'm working with a large computing cluster with a SLURM workload manager that has four different subsections: we'll call them C1, C2, C3, and C4. Nodes in C1 and C2 have 28 cores, whereas those in C3 and C4 have 40 and 52 cores, respectively. I would like to be able to use all cores per node, but when I submit a job to the queue, I have no idea to which subsection it will be assigned and therefore don't know how many cores will be available. Is there a variable in SLURM to plug into --ntasks-per-node that will tell it to use all available cores on the node?

If you request a full node, with --nodes=1 --exclusive, you will get access to all CPUs (which you can check with cat /proc/$$/status|grep Cpus). The number of CPUs available will be given by the SLURM_JOB_CPUS_PER_NODE environment variable.
But the number of tasks will be one so you might have to adjust how you start your program and set the number of CPUs explicitly, for instance with an OpenMPI program a.out:
mpirun -np $SLURM_JOB_CPUS_PER_NODE ./a.out

Related

Parallel execution with .NET TPL in Kubernetes pods

The .NET Task Parallel Library (TPL) allows to parallelize the execution of routines in multicore CPUs.
So, a collection of independent workloads can be completed with significant performance improvement if it is coded as parallel execution instead of sequential iteration.
A machine with 2 cores can execute 2 computations at the same time.
What happens when the code runs in kubernetes?
Let's suppose, for instance, to spawn a collection of workflows in parallel using TPL (e.g. Parallel.For) in 2 pods with 1500m cpu allocation:
in a cluster of 2 2-cores nodes, 1 pod per node, or
in a single 4-cores node
How many parallel instances will be running the workflow in the 2 scenarios?
According to the Documentation, Environment.ProcessorCount would return 2 in the pods, so I could assume that K8s exposes the rounded number of cores (2 in this case) but splits the allocation (so each node would get 2 cores worth 750m of the node cores). This would end up running 4 parallel executions (2 per node in scenario 1 and all in the unique node in scenario 2) with 750m power, for an aggregated computing power of 3 cores (1500m + 1500m as per configuration).
Is my assumption correct?

Pin Kubernetes pods/deployments/replica sets/daemon sets to run on specific cpu only

I need to restrict an app/deployment to run on specific cpus only (say 0-3 or just 1 or 2 etc.) I found out about CPU Manager and tried implement it with static policy but not able to achieve what I intend to.
I tried the following so far:
Enabled cpu manager static policy on kubelet and verified that it is enabled
Reserved the cpu with --reserved-cpus=0-3 option in the kubelet
Ran a sample nginx deployment with limits equal to requests and cpu of integer value i.e. QoS of guaranteed is ensured and able to validate the cpu affinity with taskset -c -p $(pidof nginx)
So, this makes my nginx app to be restricted to run on all cpus other than reserved cpus (0-3), i.e. if my machine has 32 cpus, the app can run on any of the 4-31 cpus. And so can any other apps/deployments that will run. As I understand, the reserved cpus 0-3 will be reserved for system daemons, OS daemons etc.
My questions-
Using the Kubernetes CPU Manager features, is it possible to pin certain cpu to an app/pod (in this case, my nginx app) to run on a specific cpu only (say 2 or 3 or 4-5)? If yes, how?
If point number 1 is possible, can we perform the pinning at container level too i.e. say Pod A has two containers Container B and Container D. Is it possible to pin cpu 0-3 to Container B and cpu 4 to Container B?
If none of this is possible using Kubernetes CPU Manager, what are the alternatives that are available at this point of time, if any?
As I understand your question, you want to set up your dedicated number of CPU for each app/pod. As I've searched.
I am only able to find some documentation that might help. The other one is a Github topic I think this is a workaround to your problem.
This is a disclaimer, based from what I've read, searched and understand there is no direct solution for this issue, only workarounds. I am still searching further for this.

How do you run as many tasks as will fit in memory, each running all all cores, in Windows HPC?

I'm using Microsoft HPC Pack 2012 to run video processing jobs on a Windows cluster. A run is organized as a single job with hundreds of independent tasks. If a single task is scheduled on a node, it uses all cores, but not at nearly 100%. One way to increase CPU utilization is to run more than one task at a time per node. I believe in my use case, running each task on every core would achieve the best CPU utilization. However, after lots of trying I have not been able to achieve it. Is it possible?
I have been able to run multiple tasks on the same node on separate cores. I achieved this by setting the job UnitType to Node, setting the job and task types to IsExclusive = False, and setting the MaximumNumberOfCores on a job to something less than the number of cores on the machine. For simplicity, I would like to one run task per core, but typically this would exhaust the memory budget. So, I have set EstimatedProcessMemory to the typical memory usage.
This works, but every set of parameters I have tried leaves resources on the table. For instance, let's say I have a machine with 12 cores, 15GB of free RAM, and each task consumes 2GB. Then I can run 7 tasks on this machine. If I set task MaximumNumberOfCores to 1, I only use 7 of my 12 cores. If I set it to 2, suppose I set EstimatedProcessMemory to 2048. HPC interprets this as the memory PER CORE, so I only run 3 tasks on 2 cores and 3 tasks on 1 core, so 9 of my 12 cores. And so on.
Is it possible to simply run as many tasks as will fit in memory, each running on all of the cores? Or to dynamically assign the number of cores per task in a way that doesn't have the shortcomings mentioned above?

Does assigning more nodes to a job on a SLURM server increase available RAM?

I am working with a program that needs a lot RAM. Currently I am running it on a SLURM cluster. Each node has 125GB RAM. When submitting the job to a single node it eventually fails as it runs out of memory. My rather naive question, as I am new to working on servers, is:
Does assigning more nodes with the command --nodes flag increase available RAM for the submitted job?
For example:
When assigning 10 nodes instead of 1, with the command below, the program fails at the same point as with with one node.
#SBATCH --nodes=10
Is there some other way to combine RAM from multiple nodes for a single job?
Any and all advice is welcome!
That depends on your program, but most likely no.
To use multiple nodes on a Slurm Cluster (or any cluster, for that matter), your program needs to be set up in very specific way, ie. you need inter node communictaion. This is usually done via MPI and the whole program has to be designed around it.
So if your program uses MPI it may be able to split the workload over several nodes. And even that does not guarantee lower memory as that is usually not the goal of such a parallelization.

Running multiple containers on the same Service Fabric node

I have a windows Service Fabric node with 4 cores and I want to host 3 containerized stateless services on it, where each windows container is allocated 1 core to read a message from a queue and process it. I run some experiments and got these results:
1 container running on the node: message takes ~18 sec to be
processed, avg cpu usage per container: 24.7%, memory usage: 1 GB
2 containers running on the node: message takes ~25 sec to be
processed, avg cpu usage per container: 24.4%, memory usage: 1 GB
3 containers running on the node: message takes ~35 sec to be
processed, avg cpu usage per container: 24.6%, memory usage: 1 GB
I thought that containers are supposedly isolated, and I expected the processing time to be constant at ~18s regardless of the number of containers, but in this case, it seems that adding one container affects the processing time in other containers. Each container is set to use 1 core, so they shouldn't be overstepping to use each other's resources, and cpu is not reaching full utilization. Even if cpu was a bottleneck here, I'd expect that at least 2 containers would be able to run with ~18 sec processing time.
Is there a logical explanation for the results? Isn't it not possible to run multiple containers on the same Service Fabric host without affecting the performance of each when there are enough compute resources? How big could the Service Fabric overhead possibly be when trying to run multiple containers on the same node?
Thanks!
Your container is not only using CPU, but also memory and I/O (disk, network), which can also become bottlenecks.
To see the overhead of SF, run the containers outside of SF and see if it makes a difference.
Use a machine with more memory, and after that, try using an SSD drive. See if that increases performance.
To avoid process overhead, consider using a single container and have multiple threads do parallel message processing. Make sure to assign it 3 cores.