Questions about HPC on SLURM - hpc

I have a few questions about HPC. I have a code with serial and parallel sections. Parallel sections work on different chunks of memory and at some point they communicate with each other. For this I used MPI on our cluster. SLURM is the resource manager. Below is the specifications of a node in a cluster.
Specifications of a node:
Processor: 2x Intel Xeon E5-2690 (totally 16 cores 32 thread)
Memory : 256 GB 1600MHz ECC
Disk : 2 x 600 GB 2.5" SAS (configured with raid 1)
Questions:
1) Do all cores on a node share the same memory (RAM)? If yes, do all of cores access memory at the same speed?
2) Consider a case:
--nodes = 1
--ntasks-per-node = 1
--cpus-per-task = 16 (all cores on a node)
If all cores share the same memory (depends on answer to question 1) will all cores be used or 15 of them sleep since OpenMP (for shared memory) is not used?
3) If required memory is less total memory of a node, isn't it much better to use a single node, use OpenMP to achieve core-level parallelism, and avoid time loss due to communication between nodes? That is, use this
--nodes = 1
--ntasks-per-core = 1
instead of this:
--nodes = 16
--ntasks-per-node = 1
Rest of the questions are related to statements in this link.
Use core allocation if your application is CPU bound; the more processors you can throw at it the better!
Does this statement mean that --ntasks-per-core is good when cores don't access RAM too often?
Use socket allocation if memory access is what bottlenecks your application’s performance. Since how much data can come in from memory is what limits the speed of the job, running more tasks on the same memory bus won’t result in speed-up since all of those tasks are fighting over the path to memory.
I just don't get this. What I know is all sockets and cores on sockets share the same memory. This is why I don't get why --ntasks-per-socket option is available?
Use node allocation if some node-wide resource is what bottlenecks your application. This is the case with applications that are relying heavily on access to disk or to networks resources. Running multiple tasks per node won’t result in a speed-up since all of those tasks are waiting for access to the same disk or network pipe.
Does this mean that, if memory required is more than total RAM of a single node then its better to use multiple nodes?

In order:
Yes, all cores share the same memory. But not usually at the same speed. Usually, each Processor (in your configuration, you have 2 processors or sockets) has memory that is 'closer' to it. Usually the Linux kernel will attempt to allocate memory on nearby memory. This is not something that a user application usually has to worry about.
If it is a serial job, then yes, 15 cores will sit idle. If your job uses MPI, then it can use the other cores on the same node. Actually, MPI on the same node usually much faster than MPI stretched across multiple nodes.
You can use OpenMP or MPI on a single node. I'm not sure about the speed difference, but if you are already familiar with MPI, I would just stick with that. The difference probably isn't that big. But, the difference between running MPI on a single node vs. multiple nodes is going to be large. Running MPI on a single node will be significantly faster than across multiple nodes.
Use core allocation if your application is CPU bound; the more processors you can throw at it the better!
This is likely targeting OpenMP or single node parallel jobs.
Use socket allocation if memory access is what bottlenecks your application’s performance. Since how much data can come in from memory is what limits the speed of the job, running more tasks on the same memory bus won’t result in speed-up since all of those tasks are fighting over the path to memory.
See the answer to 1. Though it is the same memory, cores usually have separate bus's to memory.
Use node allocation if some node-wide resource is what bottlenecks your application. This is the case with applications that are relying heavily on access to disk or to networks resources. Running multiple tasks per node won’t result in a speed-up since all of those tasks are waiting for access to the same disk or network pipe.
If you need more RAM than a single node can provide, then you have no choice but to divide your program and use MPI.

Related

How many threads for a kubernetes application?

I have a multithreaded application which does a CPU-intensive task that I want to run on Kubernetes.
The node I'm using has 56 cores and I set a request and limit of 2 cores for my pod.
Since it's CPU-intensive, typically there would be no point in having more threads than the # of cores (with hyperthreading, perhaps twice as many threads as cores), so I would allocate 2-4 threads and call it a day.
However, AFAIK, Kubernetes doesn't guarantee core affinity, so in the worst-case scenario, the 2 cores could be evenly split across the 56 cores, with each core working 2/56th of the time in parallel. If this happens and I allocate only 4 threads then at least 52 out of the 56 cores will be sitting idle.
If I understand it correctly, this problem is not unique to Kubernetes and applies to any virtualized environment where the hardware resources are shared.
What is the best practice when it comes to dealing with this potential worst-case scenario? Do you ignore it and assume you have high locality, or do you plan for the worst, or something in between?
Thread scheduling depends on OS including affinity, kubernetes doesn't affect it usually. If it's important to your workload - you can try Static policy in k8s to get exclusive CPU cores to your app.
You may use pod CPU limit 56 and run 56 threads to utilize all node cores. Assuming all other pods on this node use correct CPU requests, this should not affect other pods negatively.

Multi core machine - cpu load metric

In a multi core machine what is the best metric to understand whether cpu is loaded or not ?
I have a web application that sends a post request to apache CGI server. CGI server loops over the post data and launches perl process for each of the item in the loop. Since requests from clients ends up hitting a single endpoint, I am concerned if I end up creating lots of processes which my server can't handle. Hence I wanted to understand what system metric should I check before launching a new process from loop.
Note: I have a 20 core machine.
The reason the answer isn't easy to find, is that it depends on the nature of your processes, and which system constraint is your limiting factor.
For CPU intensive work, then the metric to look at is load average - load average is a measure of processes in a runnable state - very roughly if LA is the same as number of cores, then you're running your CPUs at maximum.
However, it's increasingly the case that CPU is not the limiting factor - you may have a finite amount of memory, and memory hungry processes will consume it. 'spare' memory is used for caching, so filling the whole lot up actually starts to slow things down (because you have a smaller cache). Over spilling the available will either cause swapping or OOMkiller.
But as you mention apache and web, then chances are pretty good that your network pipe is a limiting factor - controlling bandwidth from the local host is actually surprisingly hard.
And then there's disk IO - which may also be a factor - I think that's unlikely for a web server, because your outbound network will usually be a tighter limit.
It all depends what your processes are doing - if they're lightweight 'helpers' that are mostly idle, or heavyweight 'grinders' that all introduce noticeable load.
So the best answer I can give is a very vague estimate - if your processes are CPU intensive, cap them at 2 per core. If your processes are memory, aim to consume about 50% of your system RAM. If your processes are IO intensive, aim to consume about 50% of your IO (either network or disk).

Mongodb in Docker: numactl --interleave=all explanation

I'm trying to create Dockerfile for in-memory MongoDB based on official repo at https://hub.docker.com/_/mongo/.
In dockerfile-entrypoint.sh I've encountered:
numa='numactl --interleave=all'
if $numa true &> /dev/null; then
set -- $numa "$#"
fi
Basically it prepends numactl --interleave=all to original docker command, when numactl exists.
But I don't really understand this NUMA policy thing. Can you please explain what NUMA really means, and what --interleave=all stands for?
And why do we need to use it to create MongoDB instance?
The man page mentions:
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
This isn't available for all architectures, which is why issue 14 made sure to call numa only on numa machine.
As explained in "Set default numa policy to “interleave” system wide":
It seems that most applications that recommend explicit numactl definition either make a libnuma library call or incorporate numactl in a wrapper script.
The interleave=all alleviates the kind of issue met by app like cassandra (a distributed database for managing large amounts of structured data across many commodity servers):
By default, Linux attempts to be smart about memory allocations such that data is close to the NUMA node on which it runs. For big database type of applications, this is not the best thing to do if the priority is to avoid disk I/O. In particular with Cassandra, we're heavily multi-threaded anyway and there is no particular reason to believe that one NUMA node is "better" than another.
Consequences of allocating unevenly among NUMA nodes can include excessive page cache eviction when the kernel tries to allocate memory - such as when restarting the JVM.
For more, see "The MySQL “swap insanity” problem and the effects of the NUMA architecture"
Without numa
In a NUMA-based system, where the memory is divided into multiple nodes, how the system should handle this is not necessarily straightforward.
The default behavior of the system is to allocate memory in the same node as a thread is scheduled to run on, and this works well for small amounts of memory, but when you want to allocate more than half of the system memory it’s no longer physically possible to even do it in a single NUMA node: In a two-node system, only 50% of the memory is in each node.
With Numa:
An easy solution to this is to interleave the allocated memory. It is possible to do this using numactl as described above:
# numactl --interleave all command
I mentioned in the comments that numa enumerates the hardware to understand the physical layout. And then divides the processors (not cores) into “nodes”.
With modern PC processors, this means one node per physical processor, regardless of the number of cores present.
That is bit of an over-simplification, as Hristo Iliev points out:
AMD Opteron CPUs with larger number of cores are actually 2-way NUMA systems on their own with two HT (HyperTransport)-interconnected dies with own memory controllers in a single physical package.
Also, Intel Haswell-EP CPUs with 10 or more cores come with two cache-coherent ring networks and two memory controllers and can be operated in a cluster-on-die mode, which presents itself as a two-way NUMA system.
It is wiser to say that a NUMA node is some cores that can reach some memory directly without going through a HT, QPI (QuickPath_Interconnect), NUMAlink or some other interconnect.

Can memcached make full use of multi-core?

Is memcached capable of making full use of multi-core? Or is there any way tuning this?
memcached has "-t" option:
-t <threads>
Number of threads to use to process incoming requests. This option is only meaningful
if memcached was compiled with thread support enabled. It is typically not useful to
set this higher than the number of CPU cores on the memcached server. The default is
4.
so, I believe it can use all your CPU cores, of course if it was compiled with corresponding option.
memcached is multi-threaded by default and has no problem saturating many cores. It's a bit harder to saturate all cores on more massively parallel boxes (e.g. a 256-core CMT box) just because it gets harder to get the data in and out of the network.
If you find areas where some sort of contention is preventing you from saturating cores, file a bug or start a discussion.
Based on a this research by Intel, Memcached v.1.6 beta cannot scale well on a multicore system. Their experiments show that as core counts increase from 1 to 8, maximum throughput (with a median RTT < 1ms SLA) only doubles.
CAREFUL. This terminology is quite confusing. Memcached man page says -t option is only good up to the number of cores. However, this is odd because threads and processes are VERY different. Threads have NOTHING to do with the number of cores. Processes can definitely run on more than one cor, while threads cannot (unless they call to an OS routine, then they can thread switch and pack in more than 100% cpu usage). Threads share memory and just depend on an instruction pointer to differentiate who is who. Processes share nothing unless it is explicitly declared as shared ahead of time, and sharing occurs via the OS.
Overall, I want MORE CLARITY from the Memcached people about whether their app is multiprocessing or multithreaded and thus if it can use more than 100% of cpu.

What are the differences between multi-CPU, multi-core and hyper-thread?

Could anyone explain to me the differences between multi-CPU, multi-core, and hyper-thread? I am always confused about these differences, and about the pros/cons of each architecture in different scenarios.
Here is my current understanding after learning online and learning from others' comments.
I think hyper-thread is the most inferior technology among them, but cheap. Its main idea is duplicate registers to save context switch time;
Multi processor is better than hyper-thread, but since different CPUs are on different chips, the communication between different CPUs is of longer latency than multi-core, and using multiple chips, there is more expense and more power consumption than with multi-core;
multi-core integrates all the CPUs on a single chip, so the latency of communication between different CPUs are greatly reduced compared with multi-processor. Since it uses one single chip to contain all CPUs, it consumer less power and is less expensive than a multi processor system.
Is this correct?
Multi-CPU was the first version: You'd have one or more mainboards with one or more CPU chips on them. The main problem here was that the CPUs would have to expose some of their internal data to the other CPU so they wouldn't get in their way.
The next step was hyper-threading. One chip on the mainboard but it had some parts twice internally so it could execute two instructions at the same time.
The current development is multi-core. It's basically the original idea (several complete CPUs) but in a single chip. The advantage: Chip designers can easily put the additional wires for the sync signals into the chip (instead of having to route them out on a pin, then over the crowded mainboard and up into a second chip).
Super computers today are multi-cpu, multi-core: They have lots of mainboards with usually 2-4 CPUs on them, each CPU is multi-core and each has its own RAM.
[EDIT] You got that pretty much right. Just a few minor points:
Hyper-threading keeps track of two contexts at once in a single core, exposing more parallelism to the out-of-order CPU core. This keeps the execution units fed with work, even when one thread is stalled on a cache miss, branch mispredict, or waiting for results from high-latency instructions. It's a way to get more total throughput without replicating much hardware, but if anything it slows down each thread individually. See this Q&A for more details, and an explanation of what was wrong with the previous wording of this paragraph.
The main problem with multi-CPU is that code running on them will eventually access the RAM. There are N CPUs but only one bus to access the RAM. So you must have some hardware which makes sure that a) each CPU gets a fair amount of RAM access, b) that accesses to the same part of the RAM don't cause problems and c) most importantly, that CPU 2 will be notified when CPU 1 writes to some memory address which CPU 2 has in its internal cache. If that doesn't happen, CPU 2 will happily use the cached value, oblivious to the fact that it is outdated
Just imagine you have tasks in a list and you want to spread them to all available CPUs. So CPU 1 will fetch the first element from the list and update the pointers. CPU 2 will do the same. For efficiency reasons, both CPUs will not only copy the few bytes into the cache but a whole "cache line" (whatever that may be). The assumption is that, when you read byte X, you'll soon read X+1, too.
Now both CPUs have a copy of the memory in their cache. CPU 1 will then fetch the next item from the list. Without cache sync, it won't have noticed that CPU 2 has changed the list, too, and it will start to work on the same item as CPU 2.
This is what effectively makes multi-CPU so complicated. Side effects of this can lead to a performance which is worse than what you'd get if the whole code ran only on a single CPU. The solution was multi-core: You can easily add as many wires as you need to synchronize the caches; you could even copy data from one cache to another (updating parts of a cache line without having to flush and reload it), etc. Or the cache logic could make sure that all CPUs get the same cache line when they access the same part of real RAM, simply blocking CPU 2 for a few nanoseconds until CPU 1 has made its changes.
[EDIT2] The main reason why multi-core is simpler than multi-cpu is that on a mainboard, you simply can't run all wires between the two chips which you'd need to make sync effective. Plus a signal only travels 30cm/ns tops (speed of light; in a wire, you usually have much less). And don't forget that, on a multi-layer mainboard, signals start to influence each other (crosstalk). We like to think that 0 is 0V and 1 is 5V but in reality, "0" is something between -0.5V (overdrive when dropping a line from 1->0) and .5V and "1" is anything above 0.8V.
If you have everything inside of a single chip, signals run much faster and you can have as many as you like (well, almost :). Also, signal crosstalk is much easier to control.
You can find some interesting articles about dual CPU, multi-core and hyper-threading on Intel's website or in a short article from Yale University.
I hope you find here all the information you need.
In a nutshell: multi-CPU or multi-processor system has several processors. A multi-core system is a multi-processor system with several processors on the same die. In hyperthreading, multiple threads can run on the same processor (that is the context-switch time between these multiple threads is very small).
Multi-processors have been there for 30 years now but mostly in labs. Multi-core is the new popular multi-processor. Server processors nowadays implement hyperthreading along with multi-processors.
The wikipedia articles on these topics are quite illustrative.
Hyperthreading is a cheaper and slower alternative to having multiple-cores
The Intel Manual Volume 3 System Programming Guide - 325384-056US September 2015 8.7 "INTEL HYPER-THREADING TECHNOLOGY ARCHITECTURE" describes HT briefly. It contains the following diagram:
TODO it is slower by how much percent in average in real applications?
Hyperthreading is possible because modern single CPUs cores already execute multiple instructions at once with the instruction pipeline https://en.wikipedia.org/wiki/Instruction_pipelining
The instruction pipeline is a separation of functions inside of a single core to ensure that each part of the circuit is used at any given time: reading memory, decoding instructions, executing instructions, etc.
Hyperthreading separates functions further by using:
a single backend, which actually runs the instructions with its pipeline.
Dual core has two backends, which explains the greater cost and performance.
two front-ends, which take two streams of instructions and order them in a way to maximize pipelining usage of the single backend by avoiding hazards.
Dual core would also have 2 front-ends, one for each backend.
There are edge cases where instruction reordering produces no benefit, making hyperthreading useless. But it produces a significant improvement in average.
Two hyperthreads in a single core share further cache levels (TODO how many? L1?) than two different cores, which share only L3, see:
Multiple threads and CPU cache
How are cache memories shared in multicore Intel CPUs?
The interface that each hyperthread exposes to the operating system is similar to that of an actual core, and both can be controlled separately. Thus cat /proc/cpuinfo shows me 4 processors, even though I only have 2 cores with 2 hyperthreads each.
Operating systems can however take advantage of knowing which hyperthreads are on the same core to run multiple threads of a given program on a single core, which might improve cache usage.
This LinusTechTips video contains a light-hearted non-technical explanation: https://www.youtube.com/watch?v=wnS50lJicXc
Multi-CPU is a bit like multicore, but communication can only happen through RAM, not L3 cache
This means that if possible, you want to partition tasks that use the same memory a lot for each separate CPU.
E.g. the following SBI-7228R-T2X blade server contains 4 CPUs, 2 on each node:
Source.
We can see that there seem to be 4 sockets for the CPUs, each covered by a heat sink, with one open.
I think across the nodes, they don't even share RAM memory and must communicate through some kind of networking, thus representing one further step up on the hyperthread/multicore/multi-CPU hierarchy, TODO confirm:
https://scicomp.stackexchange.com/questions/7530/difference-between-nodes-and-cpus-when-running-software-on-a-cluster
SLURM nodes, tasks, cores, and cpus
https://www.quora.com/In-High-Performance-Computing-what-exactly-is-the-difference-between-the-terms-%E2%80%9Ccores-%E2%80%9D-%E2%80%9Cprocessors-%E2%80%9D-%E2%80%9Cnodes-%E2%80%9D-and-%E2%80%9Cclusters%E2%80%9D