NUMA documentations for x86-64 processor? - x86-64

I have already looked for NUMA documentations for X86-64 processors, unfortunately I only found optimization documents for NUMA.
What I want is: how do I initialize NUMA in a system (this would include getting the system's memory topology and processor topology). Does anyone know a good documentation about NUMA for X86-64 AMD and Intel processors?

I know that if you want the system topology, you can get that from the ACPI SLIT (System Locality Information Table) or SRAT (Static Resource Affinity Table). You can read more about this from the ACPI spec here (http://www.acpi.info/spec.htm), specifically sections 5.2.16 and 5.2.17.
Basically, you use the SRAT to determine which memory ranges are associated with which CPUs, and you use the SLIT to determine the relative cost of using a particular CPU/memory range. Both of these tables are optional, but in my experience, most NUMA systems at least have a useful SRAT.
As far as initialization goes, I don't think I can help much. You might want to look in to how processors are brought up on the Linux kernel (or a BSD kernel). You'll probably need to read up on local APICs too, as they are used to init x86 APs.

Related

Is micro kernel possible without MMU?

In the following link;
https://www.openhub.net/p/f9-kernel
F9 Microkernel runs on Cortex M, but Cortex M series doesn't have MMU. My knowledge on MMU and Virtual Memory are limited hence the following quesitons.
How the visibility of entire physical memory is prevented for each process without MMU?
Is it possible to achieve isolation with some static memory settings without MMU. (with enough on chip RAM to run my application and kernel then, just different hard coded memory regions for my limited processes). But still I don't will this prevent the access?
ARM Cortex-M processors lack of MMU, and there is optional memory protection unit (MPU) in some implementations such as STMicroelectronics' STM32F series.
Unlike other L4 kernels, F9 microkernel is designed for MPU-only environments, optimized for Cortex M3/M4, where ARMv7 Protected Memory System Architecture (PMSAv7) model is supported. The system address space of a PMSAv7 compliant system is protected by a MPU. Also, the available RAM is typically small (about 256 Kbytes), but a larger Physical address space (up to 32-bit) can be used with the aid of bit-banding.
MPU-protected memory is divided up into a set of regions, with the number of regions supported IMPLEMENTATION DEFINED. For example, STM32F429, provides 8 separate memory regions. In PMSAv7, the minimum protect region size is 32 bytes, and maximum is up to 4 GB. MPU provides full access over:
Protection region
Overlapping protection region
Access permissions
Exporting memory attributes to the system
MPU mismatches and permission violations invoke the programmable priority MemManage fault handler.
Memory management in F9 microkernel, can split into three conceptions:
memory pool, which represents the area of PAS with specific attributes (hardcoded in mem map table).
address space - sorted list of fpages bound to particular thread(s).
flexible page - unlike traditional pages in L4, fpage represent in MPU region instead.
Yes, but ....
There is no requirement for an MMU at all, things just get less convenient and flexible. Practically, anything that provides some form of isolation (e.g. MPU) might be good enough to make a system work - assuming you do need isolation at all. If you don't need it for some reason and just want the kernel to do scheduling, then a kernel can do this without an MMU or MPU also.

Multicore CPUs, Different types of CPUs and operating systems

An operating system should support CPU architecture and not specific CPU, for example if some company has Three types of CPUs all based of x86 architecture,
one is a single core processor, the other one a dual core and the last one has five cores, The operating system isn't CPU type based, it's architecture based, so how would the kernel know if the CPU it is running on supports multi-core processing or how many cores does it even have....
also for example Timer interrupts, Some versions of Intel's i386 processor family use PIT and others use the APIC Timer, to generate periodic timed interrupts, how does the operating system recognize that if it wants for example to config it... ( Specifically regarding timers I know they are usually set by the BIOS but the ISR handles for Timed interrupts should also recognize the timer mechanism it is running upon in order to disable / enable / modify it when handling some interrupt )
Is there such a thing as a CPU Driver that is relevant to the OS and not the BIOS?, also if someone could refer me to somewhere I could gain more knowledge about how Multi-core processing is triggered / implemented by the kernel in terms of "code" It would be great
The operating system kernel almost always has an abstraction layer called the HAL, which provides an interface above the hardware the rest of the kernel can easily use. This HAL is also architecture-dependent and not model-dependent. The CPU architecture has to define some invokation method that will allow the HAL to know about which features are present and which aren't present in the executing processor.
On the IA32/64 architecture, the is an instruction known as CPUID. You may ask another question here:
Was CPUID present from the beginning?
No, CPUID wasn't present in the earliest CPUs. In fact, it came a lot later with the developement in i386 processor. The 21st bit in the EFLAGS register indicates support for the CPUID instruction, according to Intel Manual Volume 2A.
PUSHFD
Using the PUSHFD instruction, you can copy the contents of the EFLAGS register on the stack and check if the 21st bit is set.
How does CPUID return information, if it is just an instruction?
The CPUID instruction returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers. Its output depends on the values put into the EAX and ECX registers before execution.
Each value (which is valid for CPUID) that can be put in the EAX register is known as a CPUID leaf. Some leaves have subleaves, .i.e. they depend on an sub-leaf value in the ECX register.
How is multi-core support detected at the OS kernel level?
There is a standard known as ACPI (Advanced Configuration and Power Interface) which defines a set of ACPI tables. These include the MADT or multiple APIC descriptor table. This table contains entries that have information about local APICs, I/O APICs, Interrupt Redirections and much more. Each local APIC is associated with only one logical processor, as you should know.
Using this table, the kernel can get the APIC-ID of each local APIC present in the system (only those ones whose CPUs are working properly). The APIC id itself is divided into topological Ids (bit-by-bit) whose offsets are taken using CPUID. This allows the OS know where each CPU is located - its domain, chip, core, and hyperthreading id.

Mongodb in Docker: numactl --interleave=all explanation

I'm trying to create Dockerfile for in-memory MongoDB based on official repo at https://hub.docker.com/_/mongo/.
In dockerfile-entrypoint.sh I've encountered:
numa='numactl --interleave=all'
if $numa true &> /dev/null; then
set -- $numa "$#"
fi
Basically it prepends numactl --interleave=all to original docker command, when numactl exists.
But I don't really understand this NUMA policy thing. Can you please explain what NUMA really means, and what --interleave=all stands for?
And why do we need to use it to create MongoDB instance?
The man page mentions:
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
This isn't available for all architectures, which is why issue 14 made sure to call numa only on numa machine.
As explained in "Set default numa policy to “interleave” system wide":
It seems that most applications that recommend explicit numactl definition either make a libnuma library call or incorporate numactl in a wrapper script.
The interleave=all alleviates the kind of issue met by app like cassandra (a distributed database for managing large amounts of structured data across many commodity servers):
By default, Linux attempts to be smart about memory allocations such that data is close to the NUMA node on which it runs. For big database type of applications, this is not the best thing to do if the priority is to avoid disk I/O. In particular with Cassandra, we're heavily multi-threaded anyway and there is no particular reason to believe that one NUMA node is "better" than another.
Consequences of allocating unevenly among NUMA nodes can include excessive page cache eviction when the kernel tries to allocate memory - such as when restarting the JVM.
For more, see "The MySQL “swap insanity” problem and the effects of the NUMA architecture"
Without numa
In a NUMA-based system, where the memory is divided into multiple nodes, how the system should handle this is not necessarily straightforward.
The default behavior of the system is to allocate memory in the same node as a thread is scheduled to run on, and this works well for small amounts of memory, but when you want to allocate more than half of the system memory it’s no longer physically possible to even do it in a single NUMA node: In a two-node system, only 50% of the memory is in each node.
With Numa:
An easy solution to this is to interleave the allocated memory. It is possible to do this using numactl as described above:
# numactl --interleave all command
I mentioned in the comments that numa enumerates the hardware to understand the physical layout. And then divides the processors (not cores) into “nodes”.
With modern PC processors, this means one node per physical processor, regardless of the number of cores present.
That is bit of an over-simplification, as Hristo Iliev points out:
AMD Opteron CPUs with larger number of cores are actually 2-way NUMA systems on their own with two HT (HyperTransport)-interconnected dies with own memory controllers in a single physical package.
Also, Intel Haswell-EP CPUs with 10 or more cores come with two cache-coherent ring networks and two memory controllers and can be operated in a cluster-on-die mode, which presents itself as a two-way NUMA system.
It is wiser to say that a NUMA node is some cores that can reach some memory directly without going through a HT, QPI (QuickPath_Interconnect), NUMAlink or some other interconnect.

Could someone tell me where is the code of BIOS loaded and how much memory does it take?

Could someone tell me where is the code of BIOS loaded when the CPU reset ,and how much memory does it take for different CPU architecture?I only know 64KB.
The BIOS is located in read-only memory (ROM). An x86 CPU automatically starts executing instructions at 4GB minus 16 bytes. This address is mapped to the system ROM. For more information about the x86 bootup process, see http://en.wikipedia.org/wiki/BIOS#System_startup
As for how much memory BIOS takes, it depends on many things. It isn't just dependent on the CPU architecture, but it is dependent on the vendor of the system. Different vendors use different BIOSes that may have different sizes. According to Wikipedia, "BIOS versions now exist with sizes up to 16 megabytes" so perhaps that answers your question about the size of BIOS.

minimal number of privileged instructions?

minimal number of privileged instructions?
Say we want to write a OS with minimal number of privileged instruction.
I think it should be 1, only MMU register. But what about other things? i.e mode bit, trap
Well you can implement an operating system with everything in system mode, and you could argue that there are no "privileged" instructions.
As to whether you could implement an OS with privileged and non-privileged modes using N different privileged instructions:
it would depend on the functionality you aimed to implement,
it would depend on the hardware instruction set, MMU design, etcetera, and
unless you were prepared to put months / years into a theoretical analysis, it would be a matter of debate / opinion as to whether your proposed answer was indeed correct.
Operating System need to provide security (including memory isolation from different programs) and abstraction (each program doesn't need to care how much memory is available on physical memory).
To maintain these: you need at least 1 privileged instruction.
The privilege instruction is to setup Memory Management Unit registers so that you can ensure the memory is protected. There should be no IO instructions and all IO and interrupt access should be memory mapped.
Use MMU to ensure kernel memory, kernel code, "memory of interrupt access" and "device's memory mapped IO interface" are not mapped to user space, so user processes cannot access these. These memories lies in kernel memory.