Do cache block replacement policies in x86 prefer clean blocks over dirty blocks? - x86-64

Does cache block replacement policies in x86 systems prefer clean blocks over dirty blocks? I am interested to know whether cache-line flush instructions such as clwb influence the cache replacement policy since clwb retains the block in clean state in cache.

Related

Suggestions on solving locking problem with ARM 7 CCL Lisp compiler, raspberry Pi?

Background: CCL aka OpenMCL is a very nice, venerable, lightweight but fairly fast Common Lisp compiler. It's an excellent match for the RPi because it runs on 32-bit models, and isn't too memory intensive. In theory, unlike the heavier SBCL lisp compiler, it supports threads on 32-bit RPi. But it has a long-standing mutex bug.
However, this is an ARM machine language question, not a Lisp question. I'm hoping an ARM expert will read this and have an Aha! moment from a different context.
The problem is that CCL suffers a fatal flaw on the Raspberry Pi 2 and 3 (and probably 4), as well as other ARM boards: when using threading and locking it will fail with memory corruption. The threading failure has been known for years.
I believe I isolated the issue further, to a locking failure: when a CCL thread grabs a lock (mutex) and checks to see if it owns the lock, it sometimes turns out that another thread owns the lock. Threads seem to steal each other locks, which would be fatal for garbage collection, among other things. It seems to me that information that one core has taken control of a lock does not percolate through to the other cores before the other cores grab it themselves (race condition). This bug does not happen on one-core RPis, like the Pi Zero.
I've explored this bug in this GitHub repo. The relevant function is (threadtest2) which spawns threads, performs locks, and checks lock ownership. I initially thought that the locking code might be a missing DMB instruction; DMB "ensures that the exclusive write is synchronized to all processors". Thus I put DMB instructions all over the locking code (but upon looking carefully, DMB was already there in a few spots, so the original compiler author had thought of this).
In detail, I put DMBs into just about every locking routine of arm-misc.lisp called from the futex-free version of %get-spin-lock called by %lock-recursive-lock-ptr in ARM/l0-misc.lisp, with no luck. The low-level function in ARM/l0-misc.lisp is `%ptr-store-fixnum-conditional'. This doesn't use DMB, but uses LDREX/STREX atomic update functions.
[edit] As user coredump points out below, DMB is indeed necessary on multi-cores according to blogs and ARM docs, though there is some disagreement concerning how many places it should appear, after the STREX or also before the LDREX.
Obviously, I'm not asking anyone to diagnose this compiler. My question is
Does this lock-stealing behavior ring a bell? Has anyone else seen this problem of lock-stealing or race-condition on the ARM in another context, and have they found a solution? Is there something I'm missing about DMB, or is there another instruction needed?
As an addendum, here is my annotation of the part where it might be failing, in ARM/lo-misc.lisp, function %ptr-store-fixnum-conditional - this is machine code in Lisp format. I inserted some DMBs as shown following the comment below, and it didn't help.
;; this is the function used to grab the mutex, using ldrex/strex
;; to set a memory location atomically
(defarmlapfunction %ptr-store-fixnum-conditional
((ptr arg_x) (expected-oldval arg_y) (newval arg_z))
(let ((address imm2) ;; define some variables
(actual-oldval imm1))
(macptr-ptr address ptr)
#again
(DMB) ;; my new DMB, according to Chen's blog (not ARM manual)
;; first, load word from memory with ldrex,
;; initializing atomic memory operation
;; and claiming control of this memory for this core
(ldrex actual-oldval (:# address))
;; if the actual-oldval is wrong, then give up on
;; this pointer store because the lock is taken,
;; (looping higher up in code until free)
(cmp actual-oldval expected-oldval)
(bne #done)
;;
;; 2nd part of exclusive memory access:
;; store newval into memory and put a flag into imm0
(strex imm0 newval (:# address))
;; if the exclusive store failed, another core messed
;; with memory, so loop for another ldrex/strex cycle
(cmp imm0 (:$ 0))
(bne #again)
(DMB) ;; my new DMB after conditional jump
;; success: the lock was obtained (and exclusive access
;; was cleared by strex)
(mov arg_z actual-oldval)
(bx lr) ;; return to caller in case of good mutex grab
#done
;; clear exclusive access if lock grab failed
(clrex)
(mov arg_z actual-oldval)
(DMB) ;; my new DMB. Maybe not needed?
(bx lr))) ;; return to caller in case of failed mutex grab
Addendum - Once again, I tried put DMB around every LDREX/STREX, and it didn't help. I also tried putting a DMB into every %SET-xxx function, following the ARM docs on releasing mutexes, but this was harder to trace - I couldn't find out where %%set-unsigned-long was defined, after grepping the whole source tree, so I blindly stuffed DMB before every STR instruction inside a %SET-xxx function.
I believe that CCL uses system level futex on other platforms, and does its own custom locking only (?) on the ARM, if that's another clue. Maybe the whole thing could be fixed using the OS supplied futex? Maybe no other system uses custom locks, so the ARM is just the first (multi-core) system to show breakage?
You can try, to see if it helps, to add a DMB instruction before the LDREX instruction and after the STREX instruction. The DMB instruction is a memory barrier instruction, which ensures that the exclusive write is synchronized to all processors. The DMB instruction is described in the ARM Architecture Reference Manual.

Is it correct that single-threaded process can use only PCB without touching TCB?

As titled.
Is there any OS implementation that when running a single-threaded program, the OS only uses PCB (Process Control Block) to store all the related information? Since I heard from somewhere else that every OS will create a TCB (Thread Control Block) under a PCB when running even a single-thread program.
Is it correct that single-threaded process can use only PCB without touching TCB?
4 common possibilities are:
a) Threads are executable things, a process is just a container (that contains a virtual address space, at least one thread, file handles, etc). In this case it's likely that you'll have something like a PCB and TCB (even when the process only has one thread). Most modern operating systems are like this.
b) "Tasks" are executable things. Processes, threads and a whole pile of weird stuff that are neither (e.g. tasks that share file handles but don't share virtual address spaces) are emulated on top of the underlying "tasks". In this case PCB and TCB (Thread Control Block) don't make sense (you'd have a "Task Control Block" that's like everything merged together). Modern Linux is like this.
c) From kernel's perspective, processes are executable things and threads don't exist/aren't supported (but may be emulated in user-space). In this case TCB doesn't make sense. Old Unix systems (including old versions of Linux) were like this.
d) There's only one process and threads don't exist; and there's no PCB and no TCB. MS-DOS was like this.
Note that there's also uncommon possibilities.

Do instruction sets like x86 get updated? If so, how is backwards compatibility guaranteed?

How would an older processor know how to decode new instructions it doesn't know about?
New instructions use previously-unused opcodes, or other ways to find more "coding space" (e.g. prefixes that didn't previously mean anything for a given opcode).
How would an older processor know how to decode new instructions it doesn't know about?
It won't. A binary that wants to work on old CPUs as well as new ones has to either limit itself to a baseline feature set, or detect CPU features at run-time and set function pointers to select versions of a few important functions. (aka "runtime dispatching".)
x86 has a good mechanism (the cpuid instruction) for letting code query CPU features without any kernel support needed. Some other ISAs need CPU info hard-coded into the OS or detected via I/O accesses so the only viable method is for the kernel to export info to user-space in an OS-specific way.
Or if you're building from source on a machine with a newer CPU and don't care about old CPUs, you can use gcc -O3 -march=native to let GCC use all the ISA extensions the current CPU supports, making a binary that will fault on old CPUs. (e.g. x86 #UD (UnDefined instruction) hardware exception, resulting in the OS delivering a SIGILL or equivalent to the process.)
Or in some cases, a new instruction may decode as something else on old CPUs, e.g. x86 lzcnt decodes as bsr with an ignored REP prefix on older CPUs, because x86 has basically no unused opcodes left (in 32-bit mode). Sometimes this "decode as something else" is actually useful as a graceful fallback to allow transparent use of new instructions, notably pause = rep nop = nop on old CPUs that don't know about it. So code can use it in spin loops without checking CPUID.
-march=native is common for servers where you're setting things up to just run on that server, not making a binary to distribute.
Most of the times, old processor will have "Undefined Instruction" exception. The instruction is not defined in old CPU.
In more rare cases, the instruction will execute as a different instruction. This happens when then new instruction is encoded via obligatory prefix. As an example, PAUSE is encoded as REP NOP, so it executed as nothing on older CPUs.

Why aren't buffers auto-flushed by default?

I recently had the privilege of setting $| = 1; inside my Perl script to help it talk faster with another application across a pipe.
I'm curious as to why this is not the default setting. In other words, what do I lose out on if my buffer gets flushed straightaway?
Writing to a file descriptor is done via system calls, and system calls are slow.
Buffering a stream and flushing it only once some amount of data has been written is a way to save some system calls.
Benchmark it and you will understand.
Buffered depends on the device type of the output handle: ttys are line-buffered; pipes and sockets are pipe-buffered; disks are block-buffered.
This is just basic programming. It’s not a Perl thing.
The fewer times the I/O buffer is flushed, the faster your code is in general (since it doesn't have to make a system call as often). So your code spends more time waiting for I/O by enabling auto-flush.
In a purely network I/O-driven application, that obviously makes more sense. However, in the most common use cases, line-buffered I/O (Perl's default for TTYs) allows the program to flush the buffer less often and spend more time doing CPU work. The average user wouldn't notice the difference on a terminal or in a file.

What is INT 21h?

Inspired by this question
How can I force GDB to disassemble?
I wondered about the INT 21h as a concept. Now, I have some very rusty knowledge of the internals, but not so many details. I remember that in C64 you had regular Interrupts and Non Maskable Interrupts, but my knowledge stops here. Could you please give me some clue ? Is it a DOS related strategy ?
From here:
A multipurpose DOS interrupt used for various functions including reading the keyboard and writing to the console and printer. It was also used to read and write disks using the earlier File Control Block (FCB) method.
DOS can be thought of as a library used to provide a files/directories abstraction for the PC (-and a bit more). int 21h is a simple hardware "trick" that makes it easy to call code from this library without knowing in advance where it will be located in memory. Alternatively, you can think of this as the way to utilise the DOS API.
Now, the topic of software interrupts is a complex one, partly because the concepts evolved over time as Intel added features to the x86 family, while trying to remain compatible with old software. A proper explanation would take a few pages, but I'll try to be brief.
The main question is whether you are in real mode or protected mode.
Real mode is the simple, "original" mode of operation for the x86 processor. This is the mode that DOS runs in (when you run DOS programs under Windows, a real mode processor is virtualised, so within it the same rules apply). The currently running program has full control over the processor.
In real mode, there is a vector table that tells the processor which address to jump to for every interrupt from 0 to 255. This table is populated by the BIOS and DOS, as well as device drivers, and sometimes programs with special needs. Some of these interrupts can be generated by hardware (e.g. by a keypress). Others are generated by certain software conditions (e.g. divide by 0). Any of them can be generated by executing the int n instruction.
Programs can set/clear the "enable interrupts" flag; this flag affects hardware interrupts only and does not affect int instructions.
The DOS designers chose to use interrupt number 21h to handle DOS requests - the number is of no real significance: it was just an unused entry at the time. There are many others (number 10h is a BIOS-installed interrupt routine that deals with graphics, for instance). Also note that all this is for IBM PC compatibles only. x86 processors in say embedded systems may have their software and interrupt tables arranged quite differently!
Protected mode is the complex, "security-aware" mode that was introduced in the 286 processor and much extended on the 386. It provides multiple privilege levels. The OS must configure all of this (and if the OS gets it wrong, you have a potential security exploit). User programs are generally confined to a "minimal privilege" mode of operation, where trying to access hardware ports, or changing the interrupt flag, or accessing certain memory regions, halts the program and allows the OS to decide what to do (be it terminate the program or give the program what it seems to want).
Interrupt handling is made more complex. Suffice to say that generally, if a user program does a software interrupt, the interrupt number is not used as a vector into the interrupt table. Rather a general protection exception is generated and the OS handler for said exception may (if the OS is design this way) work out what the process wants and service the request. I'm pretty sure Linux and Windows have in the past (if not currently) used this sort of mechanism for their system calls. But there are other ways to achieve this, such as the SYSENTER instruction.
Ralph Brown's interrupt list contains a lot of information on which interrupt does what. int 21, like all others, supports a wide range of functionality depending on register values.
A non-HTML version of Ralph Brown's list is also available.
The INT instruction is a software interrupt. It causes a jump to a routine pointed to by an interrupt vector, which is a fixed location in memory. The advantage of the INT instruction is that is only 2 bytes long, as oposed to maybe 6 for a JMP, and that it can easily be re-directed by modifying the contents of the interrupt vector.
Int 0x21 is an x86 software interrupt - basically that means there is an interrupt table at a fixed point in memory listing the addresses of software interrupt functions. When an x86 CPU receives the interrupt opcode (or otherwise decides that a particular software interrupt should be executed), it references that table to execute a call to that point (the function at that point must use iret instead of ret to return).
It is possible to remap Int 0x21 and other software interrupts (even inside DOS though this can have negative side effects). One interesting software interrupt to map or chain is Int 0x1C (or 0x08 if you are careful), which is the system tick interrupt, called 18.2 times every second. This can be used to create "background" processes, even in single threaded real mode (the real mode process will be interrupted 18.2 times a second to call your interrupt function).
On the DOS operating system (or a system that is providing some DOS emulation, such as Windows console) Int 0x21 is mapped to what is effectively the DOS operating systems main "API". By providing different values to the AH register, different DOS functions can be executed such as opening a file (AH=0x3D) or printing to the screen (AH=0x09).
This is from the great The Art of Assembly Language Programming about interrupts:
On the 80x86, there are three types of events commonly known as
interrupts: traps, exceptions, and interrupts (hardware interrupts).
This chapter will describe each of these forms and discuss their
support on the 80x86 CPUs and PC compatible machines.
Although the terms trap and exception are often used synonymously, we
will use the term trap to denote a programmer initiated and expected
transfer of control to a special handler routine. In many respects, a
trap is nothing more than a specialized subroutine call. Many texts
refer to traps as software interrupts. The 80x86 int instruction is
the main vehicle for executing a trap. Note that traps are usually
unconditional; that is, when you execute an int instruction, control
always transfers to the procedure associated with the trap. Since
traps execute via an explicit instruction, it is easy to determine
exactly which instructions in a program will invoke a trap handling
routine.
Chapter 17 - Interrupt Structure and Interrupt Service Routines
(Almost) the whole DOS interface was made available as INT21h commands, with parameters in the various registers. It's a little trick, using a built-in-hardware table to jump to the right code. Also INT 33h was for the mouse.
It's a "software interrupt"; so not a hardware interrupt at all.
When an application invokes a software interrupt, that's essentially the same as its making a subroutine call, except that (unlike a subroutine call) the doesn't need to know the exact memory address of the code it's invoking.
System software (e.g. DOS and the BIOS) expose their APIs to the application as software interrupts.
The software interrupt is therefore a kind of dynamic-linking.
Actually, there are a lot of concepts here. Let's start with the basics.
An interrupt is a mean to request attention from the CPU, to interrupt the current program flow, jump to an interrupt handler (ISR - Interrupt Service Routine), do some work (usually by the OS kernel or a device driver) and then return.
What are some typical uses for interrupts?
Hardware interrupts: A device requests attention from the CPU by issuing an interrupt request.
CPU Exceptions: If some abnormal CPU condition happens, such as a division by zero, a page fault, ... the CPU jumps to the corresponding interrupt handler so the OS can do whatever it has to do (send a signal to a process, load a page from swap and update the TLB/page table, ...).
Software interrupts: Since an interrupt ends up calling the OS kernel, a simple way to implement system calls is to use interrupts. But you don't need to, in x86 you could use a call instruction to some structure (some kind of TSS IIRC), and on newer x86 there are SYSCALL / SYSENTER intructions.
CPUs decide where to jump to looking at a table (exception vectors, interrupt vectors, IVT in x86 real mode, IDT in x86 protected mode, ...). Some CPUs have a single vector for hardware interrupts, another one for exceptions and so on, and the ISR has to do some work to identify the originator of the interrupt. Others have lots of vectors, and jump directly to very specific ISRs.
x86 has 256 interrupt vectors. On original PCs, these were divided into several groups:
00-04 CPU exceptions, including NMI. With later CPUs (80186, 286, ...), this range expanded, overlapping with the following ranges.
08-0F These are hardware interrupts, usually referred as IRQ0-7. The PC-AT added IRQ8-15
10-1F BIOS calls. Conceptually, these can be considered system calls, since the BIOS is the part of DOS that depends on the concrete machine (that's how it was defined in CP/M).
20-2F DOS calls. Some of these are multiplexed, and offer multitude of functions. The main one is INT 21h, which offers most of DOS services.
30-FF The rest, for use by external drivers and user programs.