Minimizing inter-core Communication in a NUMA architecture - multicore

Can anyone highlight ways by which inter-core communication can be reduced in a NUMA multicore architecture. Case study Intel NEHALEM micro architecture.

The Nehalem processor uses QuickPath Interconnect (QPI) for inter-processor/node/package communication. In a NUMA system each node has its own local memory, which is shared with other nodes in the system. When the working set of a program fits in the L1 cache and is read-only then it doesn't matter much which NUMA node owns the memory. Communication between NUMA nodes is necessary when a core gets a cache miss and the memory is owned by another node. However, this doesn't mean that it is slower to access memory owned by another node, it depends on whether the other node has it cached in the cache associated with its local memory, what Intel calls the Last Level Cache (LLC). Access by a core to a memory location that is local to that node is faster than access to memory owned by another node, but only if it misses in the LLC on both nodes. It is faster to access memory that hits in the LLC on another node than it is to go to memory on the local node, that is because memory is so much slower than the CPU and QPI is optimized for this sort of communication. Most systems don't bother trying to reduce inter-processor communication because, as you can imagine, it is not an easy problem - it requires setting affinity of threads to cores, setting affinity of the memory working set of that thread to the local memory of that node. You can read more about this in Drepper Ulrich's paper, search for NUMA. In this paper Ulrich refers to QPI as Common System Interface (CSI), which was the Intel name for QPI before announcement.

Related

What does TCM connection with Icache in this RISCV version?

In the middle of this page (https://github.com/ultraembedded/riscv), there is a block diagram about the core, I really do not know what is TCM doing in the same block with the Icache ? Is it an optional thing to be inside the CPU ?
Some embedded systems provide dedicated memory for code and/or for data.  On some of these systems, Tightly-Coupled Memory serves as a replacement for the (instruction) cache, while on other such systems this memory is in addition to and along side a cache, applying to a certain portion of the address space.  This dedicated memory may be on the chip of the processor.
This memory could be some kind of ROM or other memory that is initialized somehow prior to boot.  In any case, TCM typically isn't backed by main memory, so doesn't suffer cache misses and the associated circuitry, usually also has high performance, like a cache when a hit occurs.
Some systems refer to this as Instruction Tightly Integrated Memory, ITIM, or Data Tightly Integrated Memory, DTIM.
When a system uses ITIM or DTIM, it performs more like a Harvard architecture than the Modified Harvard architecture of laptops and desktops.
The cache has no address space. CPU does not ask for data from the cache, it just asks for a data, then the memory controller first checks the cache if the data is present in the cache. If it is in the cache, data is fetched, if not then the controller checks the RAM. All processor does is ask for data, it does not care where the data came from. In the case of TCM, the CPU can directly write data to TCM and ask data from TCM since it has a specific address. Think of TCM as a RAM that is close to the CPU.

Direct Memory Access

I have this basic doubt about DMA. When the CPU has relinquished the bus for DMA to carry on with data fetching/ storing, how does it continue processing?
I mean even the CPU's got to get it instructions, store back results to the memory/IOs through the bus, does it not?
CPUs have cache, so they can do a lot without any actual main-memory accesses. Even low-power systems tend to have caches, because driving signals off-chip costs enough energy that a cache pays for itself in energy saved by cache hits.
More importantly, DMA doesn't "take over" the RAM, or even necessarily saturate memory bandwidth. The CPU doesn't "relinquish the bus"; the memory controller accepts read/write requests from the CPU core(s) and other system devices. Running a memory-heavy task on the CPU will slow down delay DMA, as well as the other way around, as the memory controller or system agent arbitrates access to memory, queuing read and write requests from all sources.
DMA is great for transfers that are still much slower than memory bandwidth. For example SATAIII is 6 Gbits/s, while main memory bandwidth for dual-channel DDR3-1600MHz is about 25 GBytes/s. So programmed-io would spend most of its time waiting for data from the SATA controller, not even bottlenecked on storing to RAM.
An example of how the pieces fit together in a modern Intel x86 CPU:
this diagram of Intel Skylake's system architecture (including eDRAM as memory-side cache). Sorry I didn't find a simpler diagram showing just the cores and system agent, but in a system without eDRAM, the only thing to the right of the system agent is the memory controller, and everything else stays the same.
The memory controller is on-die, so the only off-chip connection in this diagram is the PCIe bus.
There are two basic types of DMA usage models. First is when a CPU is waiting for the DMA to finish - SYNCed operation or a blocking DMA call. The other is when the CPU makes an ASYNC (or non-blocking) DMA request. This lets CPU continue with the regular control flow. This way it can off-load work to DMA to do something more important.
If I understand your question correctly, and as Peter said, when a CPU has made a non-blocking DMA request, and the DMA is actively doing something on the bus, still CPU can do all the regular operations including accessing the RAM because the bus can have multiplexed traffic. Or in other words the bus can handle multiple masters at the same time.
The coherency and consistency, which makes things little more complicated, are generally maintained using the right programming paradigms based on the hardware support.

Questions about HPC on SLURM

I have a few questions about HPC. I have a code with serial and parallel sections. Parallel sections work on different chunks of memory and at some point they communicate with each other. For this I used MPI on our cluster. SLURM is the resource manager. Below is the specifications of a node in a cluster.
Specifications of a node:
Processor: 2x Intel Xeon E5-2690 (totally 16 cores 32 thread)
Memory : 256 GB 1600MHz ECC
Disk : 2 x 600 GB 2.5" SAS (configured with raid 1)
Questions:
1) Do all cores on a node share the same memory (RAM)? If yes, do all of cores access memory at the same speed?
2) Consider a case:
--nodes = 1
--ntasks-per-node = 1
--cpus-per-task = 16 (all cores on a node)
If all cores share the same memory (depends on answer to question 1) will all cores be used or 15 of them sleep since OpenMP (for shared memory) is not used?
3) If required memory is less total memory of a node, isn't it much better to use a single node, use OpenMP to achieve core-level parallelism, and avoid time loss due to communication between nodes? That is, use this
--nodes = 1
--ntasks-per-core = 1
instead of this:
--nodes = 16
--ntasks-per-node = 1
Rest of the questions are related to statements in this link.
Use core allocation if your application is CPU bound; the more processors you can throw at it the better!
Does this statement mean that --ntasks-per-core is good when cores don't access RAM too often?
Use socket allocation if memory access is what bottlenecks your application’s performance. Since how much data can come in from memory is what limits the speed of the job, running more tasks on the same memory bus won’t result in speed-up since all of those tasks are fighting over the path to memory.
I just don't get this. What I know is all sockets and cores on sockets share the same memory. This is why I don't get why --ntasks-per-socket option is available?
Use node allocation if some node-wide resource is what bottlenecks your application. This is the case with applications that are relying heavily on access to disk or to networks resources. Running multiple tasks per node won’t result in a speed-up since all of those tasks are waiting for access to the same disk or network pipe.
Does this mean that, if memory required is more than total RAM of a single node then its better to use multiple nodes?
In order:
Yes, all cores share the same memory. But not usually at the same speed. Usually, each Processor (in your configuration, you have 2 processors or sockets) has memory that is 'closer' to it. Usually the Linux kernel will attempt to allocate memory on nearby memory. This is not something that a user application usually has to worry about.
If it is a serial job, then yes, 15 cores will sit idle. If your job uses MPI, then it can use the other cores on the same node. Actually, MPI on the same node usually much faster than MPI stretched across multiple nodes.
You can use OpenMP or MPI on a single node. I'm not sure about the speed difference, but if you are already familiar with MPI, I would just stick with that. The difference probably isn't that big. But, the difference between running MPI on a single node vs. multiple nodes is going to be large. Running MPI on a single node will be significantly faster than across multiple nodes.
Use core allocation if your application is CPU bound; the more processors you can throw at it the better!
This is likely targeting OpenMP or single node parallel jobs.
Use socket allocation if memory access is what bottlenecks your application’s performance. Since how much data can come in from memory is what limits the speed of the job, running more tasks on the same memory bus won’t result in speed-up since all of those tasks are fighting over the path to memory.
See the answer to 1. Though it is the same memory, cores usually have separate bus's to memory.
Use node allocation if some node-wide resource is what bottlenecks your application. This is the case with applications that are relying heavily on access to disk or to networks resources. Running multiple tasks per node won’t result in a speed-up since all of those tasks are waiting for access to the same disk or network pipe.
If you need more RAM than a single node can provide, then you have no choice but to divide your program and use MPI.

Mongodb in Docker: numactl --interleave=all explanation

I'm trying to create Dockerfile for in-memory MongoDB based on official repo at https://hub.docker.com/_/mongo/.
In dockerfile-entrypoint.sh I've encountered:
numa='numactl --interleave=all'
if $numa true &> /dev/null; then
set -- $numa "$#"
fi
Basically it prepends numactl --interleave=all to original docker command, when numactl exists.
But I don't really understand this NUMA policy thing. Can you please explain what NUMA really means, and what --interleave=all stands for?
And why do we need to use it to create MongoDB instance?
The man page mentions:
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
This isn't available for all architectures, which is why issue 14 made sure to call numa only on numa machine.
As explained in "Set default numa policy to “interleave” system wide":
It seems that most applications that recommend explicit numactl definition either make a libnuma library call or incorporate numactl in a wrapper script.
The interleave=all alleviates the kind of issue met by app like cassandra (a distributed database for managing large amounts of structured data across many commodity servers):
By default, Linux attempts to be smart about memory allocations such that data is close to the NUMA node on which it runs. For big database type of applications, this is not the best thing to do if the priority is to avoid disk I/O. In particular with Cassandra, we're heavily multi-threaded anyway and there is no particular reason to believe that one NUMA node is "better" than another.
Consequences of allocating unevenly among NUMA nodes can include excessive page cache eviction when the kernel tries to allocate memory - such as when restarting the JVM.
For more, see "The MySQL “swap insanity” problem and the effects of the NUMA architecture"
Without numa
In a NUMA-based system, where the memory is divided into multiple nodes, how the system should handle this is not necessarily straightforward.
The default behavior of the system is to allocate memory in the same node as a thread is scheduled to run on, and this works well for small amounts of memory, but when you want to allocate more than half of the system memory it’s no longer physically possible to even do it in a single NUMA node: In a two-node system, only 50% of the memory is in each node.
With Numa:
An easy solution to this is to interleave the allocated memory. It is possible to do this using numactl as described above:
# numactl --interleave all command
I mentioned in the comments that numa enumerates the hardware to understand the physical layout. And then divides the processors (not cores) into “nodes”.
With modern PC processors, this means one node per physical processor, regardless of the number of cores present.
That is bit of an over-simplification, as Hristo Iliev points out:
AMD Opteron CPUs with larger number of cores are actually 2-way NUMA systems on their own with two HT (HyperTransport)-interconnected dies with own memory controllers in a single physical package.
Also, Intel Haswell-EP CPUs with 10 or more cores come with two cache-coherent ring networks and two memory controllers and can be operated in a cluster-on-die mode, which presents itself as a two-way NUMA system.
It is wiser to say that a NUMA node is some cores that can reach some memory directly without going through a HT, QPI (QuickPath_Interconnect), NUMAlink or some other interconnect.

memory access in multi core processors vs multiple cpu's

I have a question,
is it possible for multiple processor machine to access data from RAM ( single ram system ) ?
for eg machine has 2 processors p1, p2 which are executing in parallel , is it possible that they access same ram for read and write ( ofcos write is not on same location )
i understand that in multi core machines it will not be possible since data bus is shared.
As long as the RAM is mapped to all cores or processors (such as in a multi-threaded application) it may be accessed from any core or processor.
There is no difference if you're discussing single processor/single core, single processor/multi-core, multi-processor/(each with a) single-core or multi-processor/multi-core. Since they have no system RAM of their own - the RAM in caches is not system RAM - the only RAM available to all of them is the system RAM.
The only difference between multi-processor/single core (as in older systems) and single processor/multi-core (modern systems) is that the former needs to coordinate RAM accesses with off-chip logic whereas for the latter all coordination is on-chip and sometimes even on-die which of course results in much faster and more electronically efficient RAM accesses.
In the case of AMD's multi-processor/multi-core solutions each processor owns part of the system RAM. The processors themselves are interconnected with high-speed data (HyperTransport) channels to facilitate accesses to RAM not owned by the processor accessing it.
In any case it is up to the programmer to decide how the processors/cores access the RAM. Sure they can read and/or write to the same location if that is what the programmer wants.