How is CR8 register used to prioritize interrupts in an x86-64 CPU? - x86-64

I'm reading the Intel documentation on control registers, but I'm struggling to understand how CR8 register is used. To quote the docs (2-18 Vol. 3A):
Task Priority Level (bit 3:0 of CR8) — This sets the threshold value
corresponding to the highest- priority interrupt to be blocked. A
value of 0 means all interrupts are enabled. This field is available
in 64- bit mode. A value of 15 means all interrupts will be disabled.
I have 3 quick questions, if you don't mind:
So bits 3 thru 0 of CR8 make up those 16 levels of priority values. But priority of what? A running "thread", I assume, correct?
But what is that priority value in CR8 compared to when an interrupt is received to see if it has to be blocked or not?
When an interrupt is blocked, what does it mean? Is it "delayed" until later time, or is it just discarded, i.e. lost?

CR8 indicates the current priority of the CPU. When an interrupt is pending, bits 7:4 of the interrupt vector number is compared to CR8. If the vector is greater, it is serviced, otherwise it is held pending until CR8 is set to a lower value.
Assuming the APIC is in use, it has an IRR (Interrupt Request Register) with one bit per interrupt vector number. When that bit is set, the interrupt is pending. It can stay that way forever.
When an interrupt arrives, it is ORed into the IRR. If the interrupt is already pending (that is, the IRR bit for that vector is already set), the new interrupt is merged with the prior one. (You could say it is dropped, but I don't think of it that way; instead, I say the two are combined into one.) Because of this merging, interrupt service routines must be designed to process all the work that is ready, rather than expecting a distinct interrupt for each unit of work.

Another related point is that Windows (and I assume Linux) tries to keep the IRQ level of a CPU as low as possible at all times. Interrupt service routines do as little work as possible at their elevated hardware interrupt level and then cue a deferred procedure call to do the rest of their work at DPC IRQ level. The DPC will normally be serviced immediately unless another IRQ has arrived because they are at a higher priority than normal processes.
Once a CPU starts executing a DPC it will then execute all the DPC's in its per CPU DPC cue before returning the CPU IRQL to zero to allow normal threads to resume.
The advantage of doing it this way is that an incoming hardware IRQ of any priority can interrupt a DPC and get its own DPC on the cue almost immediately, so it never gets missed.

I should also try and explain ( as I think it is 😁) the difference between the IRQ level of a CPU and the priority of an IRQ .
Before Control Register 8 became available with x64 the CPU had no notion of an IRQ level.
The designers of windows NT decided that every logical processor in a system should have a NOTIONAL IRQ Level that would be stored in a data structure called a processor control block for each CPU. They decided there should be 32 levels for no reason I know of 😁.
Software and hardware interrupts are also assigned a level so if they are above the level that the CPU has assigned then they are allowed to continue.
Windows does NOT make use of the interrupt priority assigned by the PIC/APIC hardware, instead it uses the interrupt mask bits in them. The various pins are assigned a vector number and then they get a level.
When Windows raises the LRQL of a CPU in its PCB it also reprograms the interrupt mask of the PIC/APIC. But not straight away.
Every interrupt that occurs causes the Windows trap dispatcher to execute and compare the IRQ level with the CPU IRQL and if the IRQ level is higher the interrupt goes ahead, if not THEN Windows reprograms the mask and returns to the executing thread instead.
The reason for that is that reprogramming the PIC takes time and if no lower level IRQ comes in then windows can save its self a job.
On x64 there IS CR8 and I am still looking at how that works.

Related

How many clock cycles do the stages of a simple 5 stage processor take?

A 5 stage pipelined CPU has the following sequence of stages:
IF – Instruction fetch from instruction memory.
RD – Instruction decode and register read.
EX – Execute: ALU operation for data and address computation.
MA – Data memory access – for write access, the register read at RD state is
used.
WB – Register write back.
Now I know that an instruction fetch, for example, is from memory which can take 4 cycles (L1 cache) or up to ~150 cycles (RAM). However, in every pipelining diagram, I see something like this, where each stage is assigned a single cycle.
Now, I know of course real processors have complex pipelines with over 19 stages and every architecture is different. However, am I missing something here? With memory accesses in IF and MA, can this 5 stage pipeline take dozens of cycles?
Classic 5-stage RISC pipelines are designed around single-cycle latency L1d / L1i, allowing 1 IPC (instruction per clock) in code without cache misses or other stalls. i.e. the hopefully common / good case. Every stage must have a worst-case critical path latency of 1 cycle, or trigger a stall.
Clock speeds were lower back then (even relative to 1 gate delay) so you could get more done in a single cycle, and the caches were simpler, often 8k direct-mapped, single port, sometimes even virtually tagged (VIVT) so TLB lookup wasn't part of the access latency.
First-gen MIPS, the R2000 (and R3000), had on-chip controllers1 for its direct-mapped PIPT split L1i/L1d write-through caches, but the actual tags+data were off-chip, from 4K to 64K. Achieving the required single-cycle latency with this setup limited clock speeds to 15 MHz (R2000) or 33 MHz (R3000) with available SRAM technology. The TLB was fully on-chip.
vs. modern Intel/AMD using 32kiB 8-way VIPT L1d/L1i caches, with at least 2 read + 1 write port for L1d, at such high clock speed that access latency is 4 cycles best-case on Intel SnB-family, or 5 cycles including address-generation. Modern CPUs have larger TLBs, too, which also adds to the latency. This is ok when out-of-order execution and/or other techniques can usually hide that latency, but classic 5-stage RISCs just had one single pipeline, not separately pipelined memory access. See also Cycles/cost for L1 Cache hit vs. Register on x86? for some more links about how performance on modern superscalar out-of-order exec x86 CPUs differs from classic-RISC CPUs.
If you wanted to raise clock speeds for the same transistor performance (gate delay), you'd divide the fetch and mem stages into multiple pipeline stages (i.e. pipeline them more heavily), if cache access was even on the critical path (i.e. if cache access could no longer be done in one clock period). The downside of lengthening the pipeline is raising branch latency (cost of a mispredict, and the amount of latency a correct prediction has to hide), as well as raising total transistor cost.
Note that classic-RISC pipelines do address-generation in the EX stage, using the ALU there to calculate register + immediate, the only addressing mode supported by most RISC ISAs build around such a pipeline. So load-use latency is effectively 2 cycles for pointer-chasing, due to the load delay for forwarding back to EX.)
On a cache miss, the entire pipeline would just stall: those early pipelines lacked scoreboarding of loads to allow hit-under-miss or miss-under-miss for loads from L1d cache.
MIPS R2000 did have a 4-entry store buffer to decouple execution from cache-miss stores. (Apparently built from 4 separate R2020 write-buffer chips, according to wikipedia.) The LSI datasheet says the write-buffer chips were optional, but with write-through caches, every store has to go to DRAM and would create a stall without write buffering. Most modern CPUs use write-back caches, allowing multiple writes of the same line without creating DRAM traffic.
Also remember that CPU speed wasn't as high relative to memory for early CPUs like MIPS R2000, and single-core machines didn't need an interconnect between cores and memory controllers. (Although they maybe did have a frontside bus to a memory controller on a separate chip, a "northbridge".) But anyway, back then a cache miss to DRAM cost a lot fewer core clock cycles. It sucks to fully stall on every miss, but it wasn't like modern CPUs where it can be in the 150 to 350 cycles range (70 ns * 5 GHz). DRAM latency hasn't improved nearly as much as bandwidth and CPU clocks. See also http://www.lighterra.com/papers/modernmicroprocessors/ which has a "memory wall" section, and Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? re: why modern CPUs need multi-level caches as the mismatch between CPU speed and memory latency has grown.
Later CPUs allowed progressively more memory-level parallelism by doing things like allowing execution to continue after a non-faulting load (successful TLB lookup), only stalling when you actually read a register that was last written by a load, if the load result isn't ready yet. This allows hiding load latency on a still-short and fairly simple in-order pipeline, with some number of load buffers to track outstanding loads. And with register renaming + OoO exec, the ROB size is basically the "window" over which you can hide cache-miss latency: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
Modern x86 CPUs even have buffers between pipeline stages in the front-end to hide or partially absorb fetch bubbles (caused by L1i misses, decode stalls, low-density code, e.g. a jump to another jump, or even just failure to predict a simple always-taken branch. i.e. only detecting it when it's eventually decoded, after fetching something other than the correct path. That's right, even unconditional branches like jmp foo need some prediction for the fetch stage.)
https://www.realworldtech.com/haswell-cpu/2/ has some good diagrams. Of course, Intel SnB-family and AMD Zen-family use a decoded-uop cache because x86 machine code is hard to decode in parallel, so often they can bypass some of that front-end complexity, effectively shortening the pipeline. (wikichip has block diagrams and microarchitecture details for Zen 2.)
See also Modern Microprocessors
A 90-Minute Guide! re: modern CPUs and the "memory wall": the increasing mismatch between DRAM latency and core clock cycle time. DRAM latency has only dropped a little bit (in absolute nanoseconds) as bandwidth has continued to climb tremendously in recent years.
Footnote 1: MIPS R2000 cache details:
An R2000 datasheet shows the D-cache was write-through, and various other interesting things.
According to a 1992 usenet message from an SGI engineer, the control logic just sends 18 index bits, receiving a word of data + 8 tags bits to determine hit or not. The CPU is oblivious to the cache size; you connect up the right number of index lines to SRAM address lines. (So I guess a line-size of one 4-byte word?)
You have to use at least 10 index bits because the tag is only 20 bits wide, and you need tag+index+2(byte-in-word) to be 32, the physical address-space size. That sets a minimum cache size of 4K.
20 bits of tag for every 32 bits of data is very inefficient. With a larger cache, fewer tag bits are actually needed, since more of the address is used up as part of the index. But Paul Ries posted that R2000/R3000 does not support comparing fewer tag bits. IDK if you could wire up some of the address output lines to the tag input lines, to generate matching bits instead of storing them in SRAMs.
A 32-byte cache line would still only need 20-bit tags (at most), but would have one tag per 8 words, a factor of 8 improvement in tag overhead. CPUs with larger caches, especially L2 caches, would definitely want to use larger line sizes.
But you're probably more likely to get conflict misses with fewer larger lines, especially with a direct-mapped cache. And the memory bus can still be busy filling a previous line when you encounter another miss, even if you have critical-word-first / early-restart so the miss latency wasn't worse if the memory bus was idle to start with.

How to run a periodic thread in high frequency(> 100kHz) in a Cortex-M3 microcontroller in an RTOS?

I'm implementing a high frequency(>100kHz) Data acquisition system with an STM32F107VC microcontroller. It uses the spi peripheral to communicate with a high frequency ADC chip. I have to use an RTOS. How can I do this?
I have tried FreeRTOS but its maximum tick frequency is 1000Hz so I can't run a thread for example every 1us with FreeRTOS. I also tried Keil RTX5 and its tick frequency can be up to 1MHz but I studied somewhere that it is not recommended to set the tick frequency high because it increases the overall context switching time. So what should I do?
Thanks.
You do not want to run a task at this frequency. As you mentioned, context switches will kill the performance. This is horribly inefficient.
Instead, you want to use buffering, interrupts and DMA. Since it's a high frequency ADC chip, it probably has an internal buffer of its own. Check the datasheet for this. If the chip has a 16 samples buffer, a 100kHz sampling will only need processing at 6.25kHz. Now don't use a task to process the samples at 6.25kHz. Do the receiving in an interrupt (timer or some signal), and the interrupt should only fill a buffer, and wake up a task for processing when the buffer is full (and switch to another buffer until the task has finished). With this you can have a task that runs only every 10ms or so. An interrupt is not a context switch. On a Cortex-M3 it will have a latency of around 12 cycles, which is low enough to be negligible at 6.25kHz.
If your ADC chip doesn't have a buffer (but I doubt that), you may be ok with a 100kHz interrupt, but put as little code as possible inside.
A better solution is to use a DMA if your MCU supports that. For example, you can setup a DMA to receive from the SPI using a timer as a request generator. Depending on your case it may be impossible or tricky to configure, but a working DMA means that you can receive a large buffer of samples without any code running on your MCU.
I have to use an RTOS.
No way. If it's a requirement by your boss or client, run away from the project fast. If that's not possible, communicate your concerns in writing now to save your posterior when the reasons of failure will be discussed. If it's your idea, then reconsider now.
The maximum system clock speed of the STM32F107 is 36 MHz (72 if there is an external HSE quartz), meaning that there are only 360 to 720 system clock cycles between the ticks coming at 100 kHz. The RTX5 warning is right, a significant amount of this time would be required for task switching overhead.
It is possible to have a timer interrupt at 100 kHz, and do some simple processing in the interrupt handler (don't even think about using HAL), but I'd recommend investigating first whether it's really necessary to run code every 10 Ξs, or is it possible to offload something that it would do to the DMA or timer hardware.
Since you only have a few hundred cycles (instructions) between input, the typical solution is to use an interrupt to be alerted that data is available, and then the interrupt handler put the data somewhere so you can process them at your leisure. Of course if the data comes in continuously at that rate, you maybe in trouble with no time for actual processing. Depending on how much data is coming in and how frequent, a simple round buffer maybe sufficient. If the amount of data is relatively large (how large is large? Consider that it takes more than one CPU cycle to do a memory access, and it takes 2 memory accesses per each datum that comes in), then using DMA as #Elderbug suggested is a great solution as that consumes the minimal amount of CPU cycles.
There is no need to set the RTOS tick to match the data acquisition rate - the two are unrelated. And to do so would be a very poor and ill-advised solution.
The STM32 has DMA capability for most peripherals including SPI. You need to configure the DMA and SPI to transfer a sequence of samples directly to memory. The DMA controller has full and half transfer interrupts, and can cycle a provided buffer so that when it is full, it starts again from the beginning. That can be used to "double buffer" the sample blocks.
So for example if you use a DMA buffer of say 256 samples and sample at 100Ksps, you will get a DMA interrupt every 1.28ms independent of the RTOS tick interrupt and scheduling. On the half-transfer interrupt the first 128 samples are ready for processing, on the full-transfer, the second 128 samples can be processed, and in the 1.28ms interval, the processor is free to do useful work.
In the interrupt handler, rather then processing all the block data in the interrupt handler - which would not in any case be possible if the processing were non-deterministic or blocking, such as writing it to a file system - you might for example send the samples in blocks via a message queue to a task context that performs the less deterministic processing.
Note that none of this relies on the RTOS tick - the scheduler will run after any interrupt if that interrupt calls a scheduling function such as posting to a message queue. Synchronising actions to an RTOS clock running asynchronously to the triggering event (i.e. polling) is not a good way to achieve highly deterministic real-time response and is a particularly poor method for signal acquisition, which requires a jitter free sampling interval to avoid false artefacts in the signal from aperiodic sampling.
Your assumption that you need to solve this problem by an inappropriately high RTOS tick rate is to misunderstand the operation of the RTOS, and will probably only work if your processor is doing no other work beyond sampling data - in which case you might not need an RTOS at all, but it would not be a very efficient use of the processor.

STM32 SPI bandwith evaluation procedure

I'm testing the SPI capabilities of STM32H7. For this I'm using the SPI examples provided in STM32CubeH7 on 2 Nucleo-H743ZI boards. I will perhaps not keep this code in my own development, rigth now the goal is to understand how SPI is working and what bandwith I can get in the different modes (with DMA, with cache enabled or not, etc...).
I'd like to share the figures I've computed, as it doesn't seem very high. In the example, if I understood correctly, the CPU is # 400Mhz and the SPI bus frequency # 100MHz.
For polling mode I've measured the number of cycles of the call to function HAL_SPI_TransmitReceive.
For DMA I've measured between call to HAL_SPI_TransmitReceive_DMA and call to the transfer complete callback.
Measurements of cycles where made with SysTick clocked on internal clock. Since there is no low power usage, it should be accurate.
I've just modified ST's examples to send a buffer of 1KB.
I get around 200.000 CPU cycles in polling mode, which means around 2MB/s
And around 3MB/s in DMA mode.
Since the SPI clock runs at 100Mhz I would have expected much more, especially in DMA mode, what do you think ? Is there something wrong in my test procedure ?

Optimal size for a ring buffer with single producer and single consumer

I have a single producer, single consumer problem which (I believe) can be solved using a circular/ring buffer.
I have a micro-controller running a RTOS, with an ISR(Interrupt Service Routine) handling UART (Serial port) interrupts. When the UART raises an interrupt, the ISR posts the received characters into the circular buffer. In another RTOS task (called packet_handler), I am reading from this circular buffer and running a state machine to decode the packet. A valid packet has 64 bytes including all the framing bytes.
The UART operates at 115200, and a packet arrives every 10ms. The packet_handler runs every 10ms. However, this handler may sometimes get delayed by the scheduler due to another higher priority task executing.
If I use an arbitrarily large circular buffer, there are no packet drops. But how to determine the optimal circular buffer size in this case? At least theoretically. How is this decision of buffer size made in practice?
Currently, I am detecting overrun of the buffer through some instrumentation functions and then increasing the buffer size to reduce packet loss.
You won't be safe, ever, as you are dealing with a stochastic process (according to your explanation).
Answering your question: You will need an infinite buffer just in case the consumer task is in ready state for infinite seconds. So, you will have to change something in your initial approach:
Increase the priority of the consumer, in order to ensure the 10ms execution (the smallest buffer approach, but it may not be possible).
Try to get a better characterization of your model, in order to predict the maximum gap of time in which the consumer task won't be executed (do your system as predictable as possible).
Lose packages with a random buffer size (it may not be safe)
I would calculate in this way:
64 Byte received just know
64 Byte still in the ring buffer
+100% to be save
===================
256 Byte Buffer
But this is just a guess. You had to do some worst case test with your buffer and then spend +100% to be save.
While all of the above answers are correct and throws light on the issue, this page summarizes all the factors to be considered while choosing the size of a ring buffer.
Some queuing models can be used to theoretically analyze the problem at hand and find out the suitable size of ring buffer.
A more pragmatic approach is to start with a large buffer, then find out the maximum used buffer size in real test case (this process is called watermarking) and use this figure in the final code.
It is simply a matter of determining the maximum possible delay - the sum of the execution times of all higher priority tasks that can run - and dividing that by the packet arrival interval.
This may not be straightforward, but can be simplified by making only the most deterministic tasks higher priority and moving less deterministic and longer running tasks to lower priorities according to rate-monotonic principles Tasks that frequently run for a short time, but sporadically run for longer are candidates for being split into two tasks (and further queues) to offload the longer execution to a lower priority.

Idle state in RTOS, sleep state or lowest frequency?

In real time systems using an RTOS, what how would the RTOS handle an idle period? Would it run nop instructions at the lowest frequency supported by a Dynamic Voltage Scaling capable processor? or would it turn to a sleep state? Can anyone refer me to actual practical implementations. Thanks
It will depend entirely on the target hardware and possibly the needs and design of the application. For example on ARM Cortex-M you would typically invoke the WFI instruction which shuts down the core until the occurrence of an interrupt.
In many microcontroller/SoC cases, reducing the PLL clock frequency would affect the on-chip peripherals from which hardware interrupts might occur, so that is less likely. It would affect baud rates and timer resolution, and is perhaps hard to manage easily. There is a paper here on a tickless idle power management method on FreeRTOS/Cortex-M3.
In most cases the idle loop source is provided as part of the board-support, so you can customise it to your needs.