On x86-64, is the "movnti" instruction atomic? - x86-64

On x86-64 CPUs (either Intel or AMD), is the "movnti" instruction that writes 4/8 bytes to a 32/64-bit aligned address atomic?

Yes, movnti is atomic on naturally-aligned addresses, just like all other naturally-aligned 8/16/32/64b stores (and loads) on x86. This applies regardless of memory-type (writeback, write-combining, uncacheable, etc.) See that link for the wording of the guarantees in Intel's x86 manual.
Note that atomicity is separate from memory ordering. Normal x86 stores are release-store operations, but movnt stores are "relaxed".
Fun fact: 32-bit code can use x87 (fild/fistp) or SSE/MMX movq to do atomic 64-bit loads/stores. gcc's std::atomic implementation actually does this. It's only SSE accesses larger than 8B (e.g. movaps or movntps 16B/32B/64B vector stores) that are not guaranteed atomic. (Even 16B operations are atomic are on some hardware, but there's no standard way to detect this).

seems clearly not:
Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation such as SFENCE should be used in conjunction with MOVNTI instructions if multiple processors might use different memory types to read/write the memory location.

Related

Is the x86_64 architecture continuously being updated?

As we know, ARM updates the arm architecture continuously with recently releasing v9 I guess.
But is the x86_64 architecture also being updated continuously by Intel or AMD?
x86-64 does extensions by name, with only a de-facto policy (by Intel) of having future CPUs support all the extensions previous CPUs did (i.e. backwards compatibility).
Even that is fragmenting some with Intel introducing new ISA extensions in server CPUs but not in contemporary desktop CPUs, or movbe appeared in Atom significantly before mainstream CPUs (Haswell). And continuing to sell Pentium / Celeron CPUs without AVX or BMI1/BMI2. (Although Ice Lake and later Pentium / Celeron may finally handle 256-bit vectors with AVX2 and thus decode VEX prefixes and be able to enable BMI1/BMI2 as well.)
AMD sometimes even drops support for their ISA extensions if Intel never adopts them. (Like XOP introduced in Bulldozer-family, dropped in Zen. And FMA4 again from Bulldozer, officially dropped in Zen but still works in Zen 1, really gone in Zen 2.) See also Agner Fog's blog article Stop the instruction set war.
There unfortunately isn't an agreed-upon mechanism between vendors for architecture versions, so for example atomicity guarantees for aligned stores of various width are guaranteed by Intel in terms of "486 or later", "Pentium and later", "P6-family and later". See Why is integer assignment on a naturally aligned variable atomic on x86?
Note that the common subset of Intel's and AMD's atomicity guarantees for loads/stores to cacheable memory actually comes from AMD in this case: Intel guarantees no tearing for any 2,4, or 8-byte store that doesn't cross a cache-line boundary. But AMD only guarantees atomicity for those sizes within an aligned 8-byte chunk, and multi-socket K10 truly does tear in transfers between sockets.
Nowhere is there a single document that covers the lowest common denominator of functionality and instruction-set extensions across modern x86-64 CPUs.

Does Cache empty itself if idle for a long time?

Does cache memory refresh itself if doesn't encounter any instruction for a threshold amount of time?
What I mean is that suppose, I have a multi-core machine and I have isolated core on it. Now, for one of the cores, there was no activity for say a few seconds. In this case, will the last instructions from the instruction cache be flushed after a certain amount of time has passed?
I understand this can be architecture dependent but I am looking for general pointers on the concept.
If a cache is power-gated in a particular idle state and if it's implemented using a volatile memory technology (such as SRAM), the cache will lose its contents. In this case, to maintain the architectural state, all dirty lines must be written to some memory structure that will retain its state (such as the next level of the memory hierarchy). Most processors support power-gating idle states. For example, on Intel processors, in the core C6 and deeper states, the core is fully power-gated including all private caches. When the core wakes up from any of these states, the caches will be cold.
It can be useful in an idle state, for the purpose of saving power, to flush a cache but not power-gate it. The ACPI specification defines such a state, called C3, in Section 8.1.4 (of version 6.3):
While in the C3 state, the processor’s caches maintain state but the
processor is not required to snoop bus master or multiprocessor CPU
accesses to memory.
Later in the same section it elaborates that C3 doesn't require preserving the state of caches, but also doesn't require flushing it. Essentially, a core in ACPI C3 doesn't guarantee cache coherence. In an implementation of ACPI C3, either the system software would be required to manually flush the cache before having a core enter C3 or the hardware would employ some mechanism to ensure coherence (flushing is not the only way). This idle state can potentially save more power compared to a shallower states by not having to engage in cache coherence.
To the best of my knowledge, the only processors that implement a non-power-gating version of ACPI C3 are those from Intel, starting with the Pentium II. All existing Intel x86 processors can be categorized according to how they implement ACPI C3:
Intel Core and later and Bonnell and later: The hardware state is called C3. The implementation uses multiple power-reduction mechanisms. The one relevant to the question flushes all the core caches (instruction, data, uop, paging unit), probably by executing a microcode routine on entry to the idle state. That is, all dirty lines are written back to the closest shared level of the memory hierarchy (L2 or L3) and all valid clean lines are invalidated. This is how cache coherency is maintained. The rest of the core state is retained.
Pentium II, Pentium III, Pentium 4, and Pentium M: The hardware state is called Sleep in these processors. In the Sleep state, the processor is fully clock-gated and doesn't respond to snoops (among other things). On-chip caches are not flushed and the hardware doesn't provide an alternative mechanism that protects the valid lines from becoming incoherent. Therefore, the system software is responsible for ensuring cache coherence. Otherwise, Intel specifies that if a snoop request occurs to a processor that is transitioning into or out of Sleep or already in Sleep, the resulting behavior is unpredictable.
All others don't support ACPI C3.
Note that clock-gating saves power by:
Turning off the clock generation logic, which itself consumes power.
Turning off any logic that does something on each clock cycle.
With clock-gating, dynamic power is reduced to essentially zero. But static power is still consumed to maintain state in the volatile memory structures.
Many processors include at least one level of on-chip cache that is shared between multiple cores. The processor branded Core Solo and Core Duo (whether based on the Enhanced Pentium M or Core microarchitectures) introduced an idle state that implements ACPI C3 at the package-level where the shared cache may be gradually power-gate and restore (Intel's package-level states correspond to system-level states in the ACPI specification). This hardware state is called PC7, Enhanced Deeper Sleep State, Deep C4, or other names depending on the processor. The shared cache is much larger compared to the private caches, and so it would take much more time to fully flush. This can reduce the effectiveness of PC7. Therefore, it's flushed gradually (the last core of the package that enters CC7 performs this operation). In addition, when the package exits PC7, the shared cache is enabled gradually as well, which may reduce the cost of entering PC7 next time. This is the basic idea, but the details depend on the processor. In PC7, significant portions of the package are power-gated.
It depends on what you mean by "idle" - specifically whether being "idle" involves the cache being powered or not.
Caches usually consist of registers comprising cells of SRAM, which preserve the data stored in them as long as the cells are powered (in contrast to DRAM, which needs to be periodically refreshed). Peter alluded to this in his comment: if power is cut off, not even an SRAM cell can maintain its state and data is lost.

which component manages or provides instructions to the control unit in a processor?

I am newbie to computer architecture and I have the following questions,
Which unit or component controls operations such as incrementing program counter, loading the instruction to the IR and the other Fetch- decode-execute-write cycle?
If it is the control unit, how does it know when to perform the operations?
Is the Operating system involved in any of these tasks except scheduling which program to execute?
Why does it matter if the OS is 32 bit or 64 bit? Shouldn't we worry about the compiler or interpreter in this case?
Which unit or component controls operations such as incrementing
program counter, loading the instruction to the IR and the other
Fetch- decode-execute-write cycle?
A processor can either have a centralized control unit or a distributed control unit. Processors that are not pipelined or that have a two-stage pipeline (i.e., fetch and execute) use a centralized control unit. More sophisticated processors use distributed control where each stage of the pipeline may generate control signals. The term control refers to operations such as fetching instructions, reading data from and writing data to memory, determining the execution unit that can execute a given instruction, and determining dependencies between instructions. This is in contrast to the term datapath, which refers to the part of the CPU that contains the execution units and registers.
Ancient CPUs consisted of two components called the control path (aka control unit) and the datapath. You might have seen these terms in computer architecture textbooks. An example of such CPUs is the Intel 8086. In the 8086, the control unit is called the bus interface unit (BIU) and is responsible for the following tasks:
Calculating the physical address of the next instruction to fetch.
Calculating the physical address of memory or I/O locations to read from or write to (i.e., to perform load and store operations).
Fetching instruction bytes from memory and placing them into a buffer.
Reading and writing to memory or I/O devices.
The datapath of the 8086 is also called the execution unit and is responsible for the following tasks:
Reading the values of the registers specified by the instruction to be executed.
Writing the result of an instruction to the specified register.
Generating branch results or memory or I/O requests to the BIU. At any given cycle, the BIU has to arbitrate between either performing a read/write operation or an instruction fetch operation.
Executing instructions in the ALU.
The 8086 can be described as a pipelined processor with two stages corresponding to the two units. There is basically no decoding; the instruction bytes are hardwired to the ALU and register file to perform the specified operation.
If it is the control unit, how does it know when to perform the
operations?
Every instruction has an opcode, which is just a bunch of bits that identify the operation that needs to be performed by the ALU (on the 8086). The opcode bits can be simply hardwired by design to the ALU so that it does the right thing automatically. Other bits of the instruction may specify the operands or register identifiers, which may be passed to the register file to perform a register read or write operation.
Is the Operating system involved in any of these tasks except
scheduling which program to execute?
No.
Why does it matter if the OS is 32 bit or 64 bit? Shouldn't we worry
about the compiler or interpreter in this case?
It depends on whether the processor supports a 32-bit operating mode and a 64-bit operating mode. The number and/or size of registers and the supported features are different in different modes, which is why the instruction encoding is different. For example, the 32-bit x86 instruction set defines 8 32-bit general-purpose architectural registers while the 64-bit x86 instruction set (also called x86-64) defines 16 64-bit general-purpose architectural registers. In addition, the size of a virtual memory address is 64-bit on x86-64 and 32-bit on x86, which not only impacts the instruction encoding but also the format of an executable binary (See Executable and Linkable Format). A modern x86 processor generally supports both modes of operation. Since the instruction encoding is different, then the binary executables that contain them are different for different modes and can only run in the mode they are compiled for. A 32-bit OS means that the binaries of the OS require the processor to operate in 32-bit. This also poses a restriction that any application to be run on the OS must also be 32-bit because a 32-bit OS doesn't know how to run a 64-bit application (because the ABI is different). On the other hand, typically, a 64-bit OS is designed to run both 32-bit and 64-bit applications.
The gcc compiler will by default emit 64-bit x86 binaries if the OS it is running on is 64-bit. However, you can override that by specifying the -m32 compiler switch so that it emits a 32-bit x86 binary instead. You can use objdump to observe the many differences between a 32-bit executable and a 64-bit executable.

Multicore CPUs, Different types of CPUs and operating systems

An operating system should support CPU architecture and not specific CPU, for example if some company has Three types of CPUs all based of x86 architecture,
one is a single core processor, the other one a dual core and the last one has five cores, The operating system isn't CPU type based, it's architecture based, so how would the kernel know if the CPU it is running on supports multi-core processing or how many cores does it even have....
also for example Timer interrupts, Some versions of Intel's i386 processor family use PIT and others use the APIC Timer, to generate periodic timed interrupts, how does the operating system recognize that if it wants for example to config it... ( Specifically regarding timers I know they are usually set by the BIOS but the ISR handles for Timed interrupts should also recognize the timer mechanism it is running upon in order to disable / enable / modify it when handling some interrupt )
Is there such a thing as a CPU Driver that is relevant to the OS and not the BIOS?, also if someone could refer me to somewhere I could gain more knowledge about how Multi-core processing is triggered / implemented by the kernel in terms of "code" It would be great
The operating system kernel almost always has an abstraction layer called the HAL, which provides an interface above the hardware the rest of the kernel can easily use. This HAL is also architecture-dependent and not model-dependent. The CPU architecture has to define some invokation method that will allow the HAL to know about which features are present and which aren't present in the executing processor.
On the IA32/64 architecture, the is an instruction known as CPUID. You may ask another question here:
Was CPUID present from the beginning?
No, CPUID wasn't present in the earliest CPUs. In fact, it came a lot later with the developement in i386 processor. The 21st bit in the EFLAGS register indicates support for the CPUID instruction, according to Intel Manual Volume 2A.
PUSHFD
Using the PUSHFD instruction, you can copy the contents of the EFLAGS register on the stack and check if the 21st bit is set.
How does CPUID return information, if it is just an instruction?
The CPUID instruction returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers. Its output depends on the values put into the EAX and ECX registers before execution.
Each value (which is valid for CPUID) that can be put in the EAX register is known as a CPUID leaf. Some leaves have subleaves, .i.e. they depend on an sub-leaf value in the ECX register.
How is multi-core support detected at the OS kernel level?
There is a standard known as ACPI (Advanced Configuration and Power Interface) which defines a set of ACPI tables. These include the MADT or multiple APIC descriptor table. This table contains entries that have information about local APICs, I/O APICs, Interrupt Redirections and much more. Each local APIC is associated with only one logical processor, as you should know.
Using this table, the kernel can get the APIC-ID of each local APIC present in the system (only those ones whose CPUs are working properly). The APIC id itself is divided into topological Ids (bit-by-bit) whose offsets are taken using CPUID. This allows the OS know where each CPU is located - its domain, chip, core, and hyperthreading id.

Are Condition Codes / Flags stored in the processor registers or the main memory?

I'm new to this, so I want to make sure that my comprehension of what I read is correct.
Also, registers are always processor registers, and there no other registers which are not a part of the processor (like registers in primary/secondary memory), correct?
Most architectures have a dedicated register for storing flags. Modern x86, for example, has one 32bit-register that stores all flags. Storing the flags in main memory would make accessing them incredibly slow, compared to a register. Some architectures do support moving the flags to either another register or directly onto the stack, and vice versa.
When talking about registers, most people are referring to registers in a processor. That's not to say there aren't any registers in your PC besides the ones in your CPU. GPUs, for example, also have registers. Your memory could have a register to temporarily store read/write addresses or keep track of other information, but when looking at processors you usually won't need to know about those.