STM32 SPI bandwith evaluation procedure - stm32

I'm testing the SPI capabilities of STM32H7. For this I'm using the SPI examples provided in STM32CubeH7 on 2 Nucleo-H743ZI boards. I will perhaps not keep this code in my own development, rigth now the goal is to understand how SPI is working and what bandwith I can get in the different modes (with DMA, with cache enabled or not, etc...).
I'd like to share the figures I've computed, as it doesn't seem very high. In the example, if I understood correctly, the CPU is # 400Mhz and the SPI bus frequency # 100MHz.
For polling mode I've measured the number of cycles of the call to function HAL_SPI_TransmitReceive.
For DMA I've measured between call to HAL_SPI_TransmitReceive_DMA and call to the transfer complete callback.
Measurements of cycles where made with SysTick clocked on internal clock. Since there is no low power usage, it should be accurate.
I've just modified ST's examples to send a buffer of 1KB.
I get around 200.000 CPU cycles in polling mode, which means around 2MB/s
And around 3MB/s in DMA mode.
Since the SPI clock runs at 100Mhz I would have expected much more, especially in DMA mode, what do you think ? Is there something wrong in my test procedure ?

Related

STM32F4 Timer Triggered DMA SPI – NSS Problem

I have a STM32F417IG microcontroller an external 16bit-DAC (TI DAC81404) that is supposed to generate a Signal with a sampling rate of 32kHz. The communication via SPI should not involve any CPU resources. That is why I want to use a timer triggered DMA to shift the data with a rate of 32kHz to the SPI data register in order to send the data to the DAC.
Information about the DAC
Whenever the DAC receives a channel address and the new corresponding 16bit value the DAC will renew its output voltage to the new received value. This is achieved by:
Pulling the CS/NSS/SYNC – pin to low
Sending the 24bit/3 byte long message and
Pulling the CS back to a high state
The first 8bit of the message are containing among other information the information where the output voltage should be applied. The next and concurrently the last 16bit are containing the new value.
Information about STM32
Unfortunately the microcontroller of ST are having a hardware problem with the NSS-pin. Starting the communication via SPI the NSS-pin is pulled low. Now the pin is low as long as SPI is enabled (. (reference manual page 877). That is sadly not the right way for communicate with device that are in need of a rise of the NSS after each message. A “solution” would be to toggle the NSS-pin manually as suggested in the manual (When a master is communicating with SPI slaves which need to be de-selected between transmissions, the NSS pin must be configured as GPIO or another GPIO must be used and toggled by software.)
Problem
If DMA is used the ordinary way the CPU is only used when starting the process. By toggling the NSS twice every 1/32000 s this leads to corresponding CPU interactions.
My question is whether I missed something in order to achieve a communication without CPU.
If not my goal is now to reduce the CPU processing time to a minimum. My pIan is to trigger DMA with a timer. So every 1/32k seconds the data register of SPI is filled with the 24bit data for the DAC.
The NSS could be toggled by a timer interrupt.
I have problems achieving it because I do not know how to link the timer with the DMA of the SPI using HAL-functions. Can anyone help me?
This is a tricky one. It might be difficult to avoid having one interrupt per sample with this combination of DAC and microcontroller.
However, one approach I would look at is to have the CS signal created as a timer output-compare (like PWM). You can use multiple channels of the same timer or link multiple timers to create a delay between the CS output and the DMA trigger. You should allow some room for jitter, because depending on what else is happening the DMA might not respond instantly. This won't hurt your DAC output signal though, because it only outputs the value on the rising edge of chip select (called SYNC in the DAC datasheet) which will still be from your first timer.

Is there a trade-off for memory to memory DMA transfer when the data size is small?

I am learning about the STM32 F4 microcontroller. I'm trying to find out about limitations for using DMA.
Per my understanding and research, I know that if the data size is small (that is, the device uses DMA to generate or consume a small amount of data), the overhead is increased because DMA transfer requires the DMA controller to perform operations, thereby unnecessarily increasing system cost.
I did some reaserch and found the following:
Limitation of DMA
CPU puts all its lines at high impedance state so that the DMA controller can then transfer data directly between device and memory without CPU intervention. Clearly, it is more suitable for device with high data transfer rates like a disk.
Over a serial interface, data is transferred one bit at a time which makes it slow to use DMA.
Is that correct? What else do I need to know?
DMA -CPU puts all its lines at high impedance state
I do not know where did you take it from - but you should not use this source any more.
Frequency of the DMA transfers do not matter unless you reach the the BUS throughput. you can transfer one byte per week, month, year, decade ..... and it is absolutely OK.
In the STM32 microcontrollers it is a very important feature as we can transfer data from/to external devices even if the uC is in low power mode with the core (CPU) sleeping. DMA controller can even wake up the core when some conditions are met.
As #Vinci and #0___________ (f.k.a. #P__J__) already pointed out,
A DMA controller works autonomously and doesn't create overhead on the CPU it supplements (at least not by itself). But:
The CPU/software must perform some instructions to configure the DMA and to trigger it or have it triggered by some peripheral. For this, it needs CPU time and program memory space (usually ROM). Besides, it usually needs some additional RAM in variables to manage the software around the DMA.
Hence, you are right, using a DMA comes with some kinds of overhead.
And furthermore,
The DMA transfers make use of the memory bus(es) that connect the involved memories/registers/peripherals to the DMA controller. That is, while the DMA controller does its own work, it may cause the CPU which it tries to offload to stall in the meantime, at least for short moments when the data words are transferred (which in turn sum up for longer transfers...).
On the other hand, a DMA doesn't only help you to reduce the CPU load (regarding total CPU time to implement some feature). If used "in a smart way", it helps you to reduce software latencies to implement different functions because one part of the implementation can be "hidden" behind the DMA-driven data transfer of another part (unless, both rely on the same bus resources - see above...).
The information is right in that using a DMA requires some development work and some runtime to manage the DMA transfer itself (see also
a related question
here), which may not be worth the benefits of using DMA. That is, for small portions of data one doesn't gain as much performance (or latency) as during big transfers. On embedded systems, DMA controllers (and their channels) are limited resources so it is important to consider which part of the function benefits from such a resource most. Therefore, one would usually prefer using DMA for the data transfers to/from disks (if it is about "payload data" such as large files or video streams) over slow serial connections.
The information is wrong, however, in that DMA is not worth using on serial interfaces as those only transfer a single bit at a time. Please note that microcontrollers (as your
STM32F4)
have built-in peripheral components that convert the serial bit-by-bit stream into a byte-by-byte or word-by-word stream, which can easily be tranferred by DMA in a helpful way - especially if the size of the packets is known in advance and software doesn't have to analyse a non-formatted stream. Furthermore, not every serial connection is "slow" at all. If the project uses, e. g., an SPI flash chip, then the SPI serial connection is the one used for data transfer.

How to run a periodic thread in high frequency(> 100kHz) in a Cortex-M3 microcontroller in an RTOS?

I'm implementing a high frequency(>100kHz) Data acquisition system with an STM32F107VC microcontroller. It uses the spi peripheral to communicate with a high frequency ADC chip. I have to use an RTOS. How can I do this?
I have tried FreeRTOS but its maximum tick frequency is 1000Hz so I can't run a thread for example every 1us with FreeRTOS. I also tried Keil RTX5 and its tick frequency can be up to 1MHz but I studied somewhere that it is not recommended to set the tick frequency high because it increases the overall context switching time. So what should I do?
Thanks.
You do not want to run a task at this frequency. As you mentioned, context switches will kill the performance. This is horribly inefficient.
Instead, you want to use buffering, interrupts and DMA. Since it's a high frequency ADC chip, it probably has an internal buffer of its own. Check the datasheet for this. If the chip has a 16 samples buffer, a 100kHz sampling will only need processing at 6.25kHz. Now don't use a task to process the samples at 6.25kHz. Do the receiving in an interrupt (timer or some signal), and the interrupt should only fill a buffer, and wake up a task for processing when the buffer is full (and switch to another buffer until the task has finished). With this you can have a task that runs only every 10ms or so. An interrupt is not a context switch. On a Cortex-M3 it will have a latency of around 12 cycles, which is low enough to be negligible at 6.25kHz.
If your ADC chip doesn't have a buffer (but I doubt that), you may be ok with a 100kHz interrupt, but put as little code as possible inside.
A better solution is to use a DMA if your MCU supports that. For example, you can setup a DMA to receive from the SPI using a timer as a request generator. Depending on your case it may be impossible or tricky to configure, but a working DMA means that you can receive a large buffer of samples without any code running on your MCU.
I have to use an RTOS.
No way. If it's a requirement by your boss or client, run away from the project fast. If that's not possible, communicate your concerns in writing now to save your posterior when the reasons of failure will be discussed. If it's your idea, then reconsider now.
The maximum system clock speed of the STM32F107 is 36 MHz (72 if there is an external HSE quartz), meaning that there are only 360 to 720 system clock cycles between the ticks coming at 100 kHz. The RTX5 warning is right, a significant amount of this time would be required for task switching overhead.
It is possible to have a timer interrupt at 100 kHz, and do some simple processing in the interrupt handler (don't even think about using HAL), but I'd recommend investigating first whether it's really necessary to run code every 10 μs, or is it possible to offload something that it would do to the DMA or timer hardware.
Since you only have a few hundred cycles (instructions) between input, the typical solution is to use an interrupt to be alerted that data is available, and then the interrupt handler put the data somewhere so you can process them at your leisure. Of course if the data comes in continuously at that rate, you maybe in trouble with no time for actual processing. Depending on how much data is coming in and how frequent, a simple round buffer maybe sufficient. If the amount of data is relatively large (how large is large? Consider that it takes more than one CPU cycle to do a memory access, and it takes 2 memory accesses per each datum that comes in), then using DMA as #Elderbug suggested is a great solution as that consumes the minimal amount of CPU cycles.
There is no need to set the RTOS tick to match the data acquisition rate - the two are unrelated. And to do so would be a very poor and ill-advised solution.
The STM32 has DMA capability for most peripherals including SPI. You need to configure the DMA and SPI to transfer a sequence of samples directly to memory. The DMA controller has full and half transfer interrupts, and can cycle a provided buffer so that when it is full, it starts again from the beginning. That can be used to "double buffer" the sample blocks.
So for example if you use a DMA buffer of say 256 samples and sample at 100Ksps, you will get a DMA interrupt every 1.28ms independent of the RTOS tick interrupt and scheduling. On the half-transfer interrupt the first 128 samples are ready for processing, on the full-transfer, the second 128 samples can be processed, and in the 1.28ms interval, the processor is free to do useful work.
In the interrupt handler, rather then processing all the block data in the interrupt handler - which would not in any case be possible if the processing were non-deterministic or blocking, such as writing it to a file system - you might for example send the samples in blocks via a message queue to a task context that performs the less deterministic processing.
Note that none of this relies on the RTOS tick - the scheduler will run after any interrupt if that interrupt calls a scheduling function such as posting to a message queue. Synchronising actions to an RTOS clock running asynchronously to the triggering event (i.e. polling) is not a good way to achieve highly deterministic real-time response and is a particularly poor method for signal acquisition, which requires a jitter free sampling interval to avoid false artefacts in the signal from aperiodic sampling.
Your assumption that you need to solve this problem by an inappropriately high RTOS tick rate is to misunderstand the operation of the RTOS, and will probably only work if your processor is doing no other work beyond sampling data - in which case you might not need an RTOS at all, but it would not be a very efficient use of the processor.

Optimize power consumption with STM32L4 ADC

I'm working on a firmware development on a STM32L4. I need to sample an analog signal at around 200Hz. So basically one analog to digital conversion every 5ms.
Up to now, I was starting the ADC in continuous conversion mode, triggered by a timer. However this prevents to put the STM32 in Stop mode in between conversions, which would be very efficient in terms of power consumption since 99%+ ot the time the product has nothing to do.
So my idea is to use the single conversion mode: use a low power timer to wakeup the product from Stop mode every 5ms, launch a single conversion in the LPTIM interrupt handler (waiting for ADC end of conversion in polling), and go back to Stop mode.
Do you think it makes sense or do you see problems to proceed like this ? I'm not sure about polling for a single ADC conversion inside a handler, what do you think ? I think a single conversion on one channel should be pretty fast (I run at 80MHz, the datasheet mentions a maximum sampling time of 8us)
Do I have to disable/enable ADC (the bit ADEN) between each single conversion ?
Also, I have to know how long a single conversion lasts to assess whether the solution is interesting or not. I'm confused about the sampling time (bits SMP). The reference manual states: "This sampling time must be enough for the input voltage source to charge the embedded capacitor to the input voltage level." What is the way to find the right SMP value ?
There are no problems with the general idea, LPTIM1 can generate wakeup events through the EXTI controller even in Stop2 mode.
I'm not sure about polling for a single ADC conversion inside a handler, what do you think ?
You might want to put the MCU in Sleep mode in the timer interrupt, and have the ADC trigger an interrupt when the conversion is complete. So disable SLEEPDEEP in the timer interrupt, and enable it in the ADC interrupt.
What is the way to find the right SMP value ?
Empirical method: start with the longest sampling time, and start decreasing it. When the conversion result significantly changes, go one or two steps back.

Idle state in RTOS, sleep state or lowest frequency?

In real time systems using an RTOS, what how would the RTOS handle an idle period? Would it run nop instructions at the lowest frequency supported by a Dynamic Voltage Scaling capable processor? or would it turn to a sleep state? Can anyone refer me to actual practical implementations. Thanks
It will depend entirely on the target hardware and possibly the needs and design of the application. For example on ARM Cortex-M you would typically invoke the WFI instruction which shuts down the core until the occurrence of an interrupt.
In many microcontroller/SoC cases, reducing the PLL clock frequency would affect the on-chip peripherals from which hardware interrupts might occur, so that is less likely. It would affect baud rates and timer resolution, and is perhaps hard to manage easily. There is a paper here on a tickless idle power management method on FreeRTOS/Cortex-M3.
In most cases the idle loop source is provided as part of the board-support, so you can customise it to your needs.