Idle state in RTOS, sleep state or lowest frequency? - rtos

In real time systems using an RTOS, what how would the RTOS handle an idle period? Would it run nop instructions at the lowest frequency supported by a Dynamic Voltage Scaling capable processor? or would it turn to a sleep state? Can anyone refer me to actual practical implementations. Thanks

It will depend entirely on the target hardware and possibly the needs and design of the application. For example on ARM Cortex-M you would typically invoke the WFI instruction which shuts down the core until the occurrence of an interrupt.
In many microcontroller/SoC cases, reducing the PLL clock frequency would affect the on-chip peripherals from which hardware interrupts might occur, so that is less likely. It would affect baud rates and timer resolution, and is perhaps hard to manage easily. There is a paper here on a tickless idle power management method on FreeRTOS/Cortex-M3.
In most cases the idle loop source is provided as part of the board-support, so you can customise it to your needs.

Related

How to run a periodic thread in high frequency(> 100kHz) in a Cortex-M3 microcontroller in an RTOS?

I'm implementing a high frequency(>100kHz) Data acquisition system with an STM32F107VC microcontroller. It uses the spi peripheral to communicate with a high frequency ADC chip. I have to use an RTOS. How can I do this?
I have tried FreeRTOS but its maximum tick frequency is 1000Hz so I can't run a thread for example every 1us with FreeRTOS. I also tried Keil RTX5 and its tick frequency can be up to 1MHz but I studied somewhere that it is not recommended to set the tick frequency high because it increases the overall context switching time. So what should I do?
Thanks.
You do not want to run a task at this frequency. As you mentioned, context switches will kill the performance. This is horribly inefficient.
Instead, you want to use buffering, interrupts and DMA. Since it's a high frequency ADC chip, it probably has an internal buffer of its own. Check the datasheet for this. If the chip has a 16 samples buffer, a 100kHz sampling will only need processing at 6.25kHz. Now don't use a task to process the samples at 6.25kHz. Do the receiving in an interrupt (timer or some signal), and the interrupt should only fill a buffer, and wake up a task for processing when the buffer is full (and switch to another buffer until the task has finished). With this you can have a task that runs only every 10ms or so. An interrupt is not a context switch. On a Cortex-M3 it will have a latency of around 12 cycles, which is low enough to be negligible at 6.25kHz.
If your ADC chip doesn't have a buffer (but I doubt that), you may be ok with a 100kHz interrupt, but put as little code as possible inside.
A better solution is to use a DMA if your MCU supports that. For example, you can setup a DMA to receive from the SPI using a timer as a request generator. Depending on your case it may be impossible or tricky to configure, but a working DMA means that you can receive a large buffer of samples without any code running on your MCU.
I have to use an RTOS.
No way. If it's a requirement by your boss or client, run away from the project fast. If that's not possible, communicate your concerns in writing now to save your posterior when the reasons of failure will be discussed. If it's your idea, then reconsider now.
The maximum system clock speed of the STM32F107 is 36 MHz (72 if there is an external HSE quartz), meaning that there are only 360 to 720 system clock cycles between the ticks coming at 100 kHz. The RTX5 warning is right, a significant amount of this time would be required for task switching overhead.
It is possible to have a timer interrupt at 100 kHz, and do some simple processing in the interrupt handler (don't even think about using HAL), but I'd recommend investigating first whether it's really necessary to run code every 10 μs, or is it possible to offload something that it would do to the DMA or timer hardware.
Since you only have a few hundred cycles (instructions) between input, the typical solution is to use an interrupt to be alerted that data is available, and then the interrupt handler put the data somewhere so you can process them at your leisure. Of course if the data comes in continuously at that rate, you maybe in trouble with no time for actual processing. Depending on how much data is coming in and how frequent, a simple round buffer maybe sufficient. If the amount of data is relatively large (how large is large? Consider that it takes more than one CPU cycle to do a memory access, and it takes 2 memory accesses per each datum that comes in), then using DMA as #Elderbug suggested is a great solution as that consumes the minimal amount of CPU cycles.
There is no need to set the RTOS tick to match the data acquisition rate - the two are unrelated. And to do so would be a very poor and ill-advised solution.
The STM32 has DMA capability for most peripherals including SPI. You need to configure the DMA and SPI to transfer a sequence of samples directly to memory. The DMA controller has full and half transfer interrupts, and can cycle a provided buffer so that when it is full, it starts again from the beginning. That can be used to "double buffer" the sample blocks.
So for example if you use a DMA buffer of say 256 samples and sample at 100Ksps, you will get a DMA interrupt every 1.28ms independent of the RTOS tick interrupt and scheduling. On the half-transfer interrupt the first 128 samples are ready for processing, on the full-transfer, the second 128 samples can be processed, and in the 1.28ms interval, the processor is free to do useful work.
In the interrupt handler, rather then processing all the block data in the interrupt handler - which would not in any case be possible if the processing were non-deterministic or blocking, such as writing it to a file system - you might for example send the samples in blocks via a message queue to a task context that performs the less deterministic processing.
Note that none of this relies on the RTOS tick - the scheduler will run after any interrupt if that interrupt calls a scheduling function such as posting to a message queue. Synchronising actions to an RTOS clock running asynchronously to the triggering event (i.e. polling) is not a good way to achieve highly deterministic real-time response and is a particularly poor method for signal acquisition, which requires a jitter free sampling interval to avoid false artefacts in the signal from aperiodic sampling.
Your assumption that you need to solve this problem by an inappropriately high RTOS tick rate is to misunderstand the operation of the RTOS, and will probably only work if your processor is doing no other work beyond sampling data - in which case you might not need an RTOS at all, but it would not be a very efficient use of the processor.

STM32 SPI bandwith evaluation procedure

I'm testing the SPI capabilities of STM32H7. For this I'm using the SPI examples provided in STM32CubeH7 on 2 Nucleo-H743ZI boards. I will perhaps not keep this code in my own development, rigth now the goal is to understand how SPI is working and what bandwith I can get in the different modes (with DMA, with cache enabled or not, etc...).
I'd like to share the figures I've computed, as it doesn't seem very high. In the example, if I understood correctly, the CPU is # 400Mhz and the SPI bus frequency # 100MHz.
For polling mode I've measured the number of cycles of the call to function HAL_SPI_TransmitReceive.
For DMA I've measured between call to HAL_SPI_TransmitReceive_DMA and call to the transfer complete callback.
Measurements of cycles where made with SysTick clocked on internal clock. Since there is no low power usage, it should be accurate.
I've just modified ST's examples to send a buffer of 1KB.
I get around 200.000 CPU cycles in polling mode, which means around 2MB/s
And around 3MB/s in DMA mode.
Since the SPI clock runs at 100Mhz I would have expected much more, especially in DMA mode, what do you think ? Is there something wrong in my test procedure ?

What is responsible for changing core's load and frequency in multicore processor

Having looked for a description of the multicore design i keep finding several diagrams, but all of them look somewhat like this:
I know from looking at i7z command output that different cores can run at different frequencies.
This would suggest that the decisions regarding which core will be given a new process and for changing the frequency of the core itself are done either by the operating system or by the control block of the core itself.
My question is: What controls the frequencies of each individual core? Is the job of associating a READY process with the specific core placed upon the operating system or is it done by something within the processor.
Scheduling processes/threads to cores is purely up to the OS. The hardware has no understanding of tasks waiting to run. Maintaining the OS's list of processes that are runnable vs. waiting for I/O is completely a software thing.
Migrating a thread from one core to another is done by kernel code on the original core storing the architectural state to memory, then OS code on the new core restoring that saved state and resuming user-space execution.
Traditionally, frequency and voltage scaling decisions are made by the OS. Take Linux as an example: The decision-making code is called a governor (and also this arch wiki link came up high on google). It looks at things like how often processes have used their entire time slice on the current core. If the governor decides the CPU should run at a different speed, it programs some control registers to implement the change. As I understand it, the hardware takes care of choosing the right voltage to support the requested frequency.
As I understand it, the OS running on each core makes decisions independently. On hardware that allows each core to run at different frequencies, the decision-making code doesn't need to coordinate with each other. If running a high frequency on one core requires a high voltage chip-wide, the hardware takes care of that. I think the modern implementation of DVFS (dynamic voltage and frequency scaling) is fairly high-level, with the OS just telling the hardware which of N choices it wants, and the onboard power microcontroller taking care of the details of programming oscillators / clock dividers and voltage regulators.
Intel's "Turbo" feature, which opportunistically boosts the frequency above the max sustainable frequency, does the decision making in hardware. Any time the OS requests the highest advertised frequency, the CPU uses turbo when power and cooling allow.
Intel's Skylake takes this a step further: The OS can hand full control over DVFS to the hardware, optionally with constraints. That lets it react from microsecond to microsecond, rather than on a timescale of milliseconds. This does actually allow better performance in bursty workloads, because more power budget is available for turbo when it's useful. A few benchmarks are bursty enough to observe this, like some browser / javascript ones IIRC.
There was a whole talk about Skylake's new power management at IDF2015, check out the slides and/or archived webcast. The old method is described in a lot of detail there, too, to illustrate the difference, so you should really check it out if you want more detail than my summary. (The list of other IDF talks is here, thanks to Agner Fog's blog for the link)
The core frequency is controlled by a given voltage applied to a core's "oscillator".
This voltage can be changed by the Operating System but it can also be changed by the BIOS itself if a high temperature is detected in the CPU.

RTC vs PIT for scheduler

My professor said that it is recommended to use the PIT instead of the RTC to implement a epoch based round robin scheduler. He didn't really mention any concrete reasons and I can't think of any either. Any thoughts?
I personally would use the PIT (if you can Only choose between these two, modern OSes use the HPET iirc)
One, it can generate interrupts at a faster frequency (although I question if preempting a process within milliseconds is beneficial)
two, it has a higher priority on the PIC chip, which means it can't be interrupted by other IRQs.
Personally I use the PIT for the scheduler and the RTC timer for wall clock time keeping.
The RTC can be changed (it is, after all, a normal "clock"), meaning it's values can't be trusted from an OS perspective. It might also not have good enough resolution and/or precision needed for OS scheduler interrupts.
While this doesn't answer the question directly, here are some further insights into choosing the preemption timer.
On modern systems (i586+; I am not sure if i486's external local APIC (LAPIC) had timer) you should use neither, because you always get the local APIC timer, which is per-core. There's even more: using either PIT or RTC for timer interrupts is already obsolete.
The LAPIC timer is usually used for preemption on modern systems, while HPET is used for high precision events. On systems having HPET, there's usually no physical PIT; also, first two comparators of HPET are capable of replacing PIT and RTC interrupt sources, which is the simplest possible configuration for them and is preferred in most cases.
PITs are faster. RTCs typically increment no faster than 8 kHz and are most commonly configured to increment at 1 Hz (once a second).
PIT has interrupt function.
PIT has higher resolution than Real-Time Clock.

Difference between Latency and Jitter in Operating-Systems

discussing criterias for Operating-Systems every time I hear Interupt-Latency and OS-Jitter. And now I ask myself, what is the Difference between these two.
In my opinion the Interrupt-Latency is the Delay from occurence of an Interupt until the Interupt-Service-Routine (ISR) is entered.
On the contrary Jitter is the time the moment of entering the ISR differs over time.
Is this the same you think?
Your understanding is basically correct.
Latency = Delay between an event happening in the real world and code responding to the event.
Jitter = Differences in Latencies between two or more events.
In the realm of clustered computing, especially when dealing with massive scale out solutions, there are cases where work distributed across many systems (and many many processor cores) needs to complete in fairly predictable time-frames. An operating system, and the software stack being leveraged, can introduce some variability in the run-times of these "chunks" of work. This variability is often referred to as "OS Jitter". link
Interrupt latency, as you said is the time between interrupt signal and entry into the interrupt handler.
Both the concepts are orthogonal to each other. However, practically, more interrupts generally implies more OS Jitter.