Can a sub-microsecond clock resolution be achieved with current hardware? - x86-64

I have a thread that needs to process a list of items every X nanoseconds, where X < 1 microsecond. I understand that with standard x86 hardware the clock resolution is at best 15 - 16 milliseconds. Is there hardware available that would enable a clock resolution < 1 microsecond? At present, the thread runs continuously as the resolution of nanosleep() is insufficient. The thread obtains the current time from a GPS reference.

You can get the current time with extremely high precision on x86 using the rdtsc instruction. It counts clock cycles (on a fixed reference clock, not the actually dynamic frequency CPU clock), so you can use it as a time source once you find the coefficients that map it to real GPS-time.
This is the clock-source Linux uses internally, on new enough hardware. (Older CPUs had the rdtsc clock pause when the CPU was halted on idle, and/or change frequency with CPU frequency scaling). It was originally intended for measuring CPU-time, but it turns out that a very precise clock with very low-cost reads (~30 clock cycles) was valuable, hence decoupling it from CPU core clock changes.
It sounds like an accurate clock isn't your only problem, though: If you need to process a list every ~1 us, without ever missing a wakeup, you need a realtime OS, or at least realtime functionality on top of a regular OS (like Linux).
Knowing what time it is when you do eventually wake up doesn't help if you slept 10 ms too long because you read a page of memory that the OS decided to evict, and had to get from disk.

Related

STM32 ADC: leave it running at 'high' speed or switch it off as much as possible?

I am using a G0 with one ADC and 8 channels. Works fine. I use 4 channels. One is temperature that is measured constantly and I am interested in the value every 60s. Another one is almost the opposite: it is measuring sound waves for a couple a minutes per day and I need those samples at 10kHz.
I solved this by letting all 4 channels sample at 10kHz and have the four readings moved to memory by DMA (array of length 4 with 1 measurement each). Every 60s I take the temperature and when I need the audio, I retrieve the audio values.
If I had two ADC's, I would start the temperature ADC reading for 1 conversion every 60s. Non-stop. And I would only start the audio ADC for the the couple of minutes a day that it is needed. But with the one ADC solution, it seems simple to let all conversions run at this high speed continuously and that raised my question: Is there any true downside in having 40.000 conversions per second, 24 hours per day? If not, the code is simple. I just have the most recent values in memory all the time. But maybe I ruin the chip? I use too much energy I know. But there is plenty of it in this case.
You aren't going to "wear it out" by running it when you don't need to.
The main problems are wasting power and RAM.
If you have enough of these, then the lesser problems are:
The wasted power will become heat, this may upset your temperature measurements (this is a very small amount though).
Having the DMA running will increase your interrupt latency and maybe also slow down the processor slightly, if it encounters bus contention (this only matters if you are close to capacity in these regards).
Having it running all the time may also have the advantage of more stable readings, not being perturbed turning things on and off.

How to run a periodic thread in high frequency(> 100kHz) in a Cortex-M3 microcontroller in an RTOS?

I'm implementing a high frequency(>100kHz) Data acquisition system with an STM32F107VC microcontroller. It uses the spi peripheral to communicate with a high frequency ADC chip. I have to use an RTOS. How can I do this?
I have tried FreeRTOS but its maximum tick frequency is 1000Hz so I can't run a thread for example every 1us with FreeRTOS. I also tried Keil RTX5 and its tick frequency can be up to 1MHz but I studied somewhere that it is not recommended to set the tick frequency high because it increases the overall context switching time. So what should I do?
Thanks.
You do not want to run a task at this frequency. As you mentioned, context switches will kill the performance. This is horribly inefficient.
Instead, you want to use buffering, interrupts and DMA. Since it's a high frequency ADC chip, it probably has an internal buffer of its own. Check the datasheet for this. If the chip has a 16 samples buffer, a 100kHz sampling will only need processing at 6.25kHz. Now don't use a task to process the samples at 6.25kHz. Do the receiving in an interrupt (timer or some signal), and the interrupt should only fill a buffer, and wake up a task for processing when the buffer is full (and switch to another buffer until the task has finished). With this you can have a task that runs only every 10ms or so. An interrupt is not a context switch. On a Cortex-M3 it will have a latency of around 12 cycles, which is low enough to be negligible at 6.25kHz.
If your ADC chip doesn't have a buffer (but I doubt that), you may be ok with a 100kHz interrupt, but put as little code as possible inside.
A better solution is to use a DMA if your MCU supports that. For example, you can setup a DMA to receive from the SPI using a timer as a request generator. Depending on your case it may be impossible or tricky to configure, but a working DMA means that you can receive a large buffer of samples without any code running on your MCU.
I have to use an RTOS.
No way. If it's a requirement by your boss or client, run away from the project fast. If that's not possible, communicate your concerns in writing now to save your posterior when the reasons of failure will be discussed. If it's your idea, then reconsider now.
The maximum system clock speed of the STM32F107 is 36 MHz (72 if there is an external HSE quartz), meaning that there are only 360 to 720 system clock cycles between the ticks coming at 100 kHz. The RTX5 warning is right, a significant amount of this time would be required for task switching overhead.
It is possible to have a timer interrupt at 100 kHz, and do some simple processing in the interrupt handler (don't even think about using HAL), but I'd recommend investigating first whether it's really necessary to run code every 10 μs, or is it possible to offload something that it would do to the DMA or timer hardware.
Since you only have a few hundred cycles (instructions) between input, the typical solution is to use an interrupt to be alerted that data is available, and then the interrupt handler put the data somewhere so you can process them at your leisure. Of course if the data comes in continuously at that rate, you maybe in trouble with no time for actual processing. Depending on how much data is coming in and how frequent, a simple round buffer maybe sufficient. If the amount of data is relatively large (how large is large? Consider that it takes more than one CPU cycle to do a memory access, and it takes 2 memory accesses per each datum that comes in), then using DMA as #Elderbug suggested is a great solution as that consumes the minimal amount of CPU cycles.
There is no need to set the RTOS tick to match the data acquisition rate - the two are unrelated. And to do so would be a very poor and ill-advised solution.
The STM32 has DMA capability for most peripherals including SPI. You need to configure the DMA and SPI to transfer a sequence of samples directly to memory. The DMA controller has full and half transfer interrupts, and can cycle a provided buffer so that when it is full, it starts again from the beginning. That can be used to "double buffer" the sample blocks.
So for example if you use a DMA buffer of say 256 samples and sample at 100Ksps, you will get a DMA interrupt every 1.28ms independent of the RTOS tick interrupt and scheduling. On the half-transfer interrupt the first 128 samples are ready for processing, on the full-transfer, the second 128 samples can be processed, and in the 1.28ms interval, the processor is free to do useful work.
In the interrupt handler, rather then processing all the block data in the interrupt handler - which would not in any case be possible if the processing were non-deterministic or blocking, such as writing it to a file system - you might for example send the samples in blocks via a message queue to a task context that performs the less deterministic processing.
Note that none of this relies on the RTOS tick - the scheduler will run after any interrupt if that interrupt calls a scheduling function such as posting to a message queue. Synchronising actions to an RTOS clock running asynchronously to the triggering event (i.e. polling) is not a good way to achieve highly deterministic real-time response and is a particularly poor method for signal acquisition, which requires a jitter free sampling interval to avoid false artefacts in the signal from aperiodic sampling.
Your assumption that you need to solve this problem by an inappropriately high RTOS tick rate is to misunderstand the operation of the RTOS, and will probably only work if your processor is doing no other work beyond sampling data - in which case you might not need an RTOS at all, but it would not be a very efficient use of the processor.

How does an OS with a 1 kHz tick rate make nanosecond measurements?

I've worked on low level devices such as microcontrollers, and I understand how timers and interrupts work; but, I was wondering how an operating system with a 1 kHz tick rate is able to give time measurements of events down to the nanosecond precision. Are they using timers with input capture or something like that? Is there a limit to how many measurements you can take at once?
On modern x86 platforms, the TSC can be used for high resolution time measurement, and even to generate high resolution interrupts (that's a different interface of course, but it counts in TSC ticks). There is no inherent limit to the number of parallel measurements. Before the "Invariant TSC" feature was introduced it was much harder to use the TSC that way, because the TSC frequency changed with the C-state (and so usually went out of sync across cores and in any case was hard to use as a stopwatch).

STM32F103 Input Capture Too Slow

I have a high speed clock at 10 MHz going to the processor's TIM4 input capture pin (ch.3). I would like to verify that the clock is running at 10 MHz with the processor's input capture. I coded the processor with the input capture module, and it works fine for lower frequencies (around 1 kHz or so). Once I start to climb the frequency up to the MHz range, the processor starts to miss interrupts and thus gives me an incorrect frequency. I didn't see anywhere in the datasheet that states the maximum frequency that the input capture can read. I have an external clock of 8 MHz, and a core clock of 72 MHz, so I would imagine that I can read a 10 MHz signal. Any ideas?
Take a look at the TIM_ICInitStructure.TIM_ICPrescaler options. Usually you'll have it set to TIM_ICPSC_DIV1 so that interrupts are generated on every valid transition.
Prescaler values of 1,2,4 and 8 are available that will allow you to effictively reduce the rate of interrupt generation by that factor. For example, for a 10MHz signal with a prescaler of 8 you'd expect to count a frequency of 10Mhz/8 = 1.25MHz.
This is still quite tight for a 72MHz HCLK so you'll still need to optimise your IRQ handler carefully.
Looks like you're generating an interrupt request for every rising (or falling) edge of the clock.
If that is indeed the case, then think about this for a second: with a 10 MHz input signal, you're generating an interrupt about every 7 CPU cycles. In these 7 CPU cycles, you need to budget time to save registers to RAM, run the IRQ handler function prolog, run the actual code you wrote for the interrupt handler, run the IRQ handler function epilogue, and restore the registers.
Best case, if you set compiler flags to optimize for speed and you're not doing much processing in the interrupt handler, you're looking at tens of cycles to run all these tasks. Since you only have 7 cycles to run tens of cycles' worth of processing, it's no surprise that you're missing interrupts.
You can't use an interrupt routine at that frequency. You need to feed the 10MHz in as an external trigger to the timer. Then you can use the prescaler and the timer to divide down to a suitable lower interrupt frequency.

Idle state in RTOS, sleep state or lowest frequency?

In real time systems using an RTOS, what how would the RTOS handle an idle period? Would it run nop instructions at the lowest frequency supported by a Dynamic Voltage Scaling capable processor? or would it turn to a sleep state? Can anyone refer me to actual practical implementations. Thanks
It will depend entirely on the target hardware and possibly the needs and design of the application. For example on ARM Cortex-M you would typically invoke the WFI instruction which shuts down the core until the occurrence of an interrupt.
In many microcontroller/SoC cases, reducing the PLL clock frequency would affect the on-chip peripherals from which hardware interrupts might occur, so that is less likely. It would affect baud rates and timer resolution, and is perhaps hard to manage easily. There is a paper here on a tickless idle power management method on FreeRTOS/Cortex-M3.
In most cases the idle loop source is provided as part of the board-support, so you can customise it to your needs.