Optimal size for a ring buffer with single producer and single consumer - real-time

I have a single producer, single consumer problem which (I believe) can be solved using a circular/ring buffer.
I have a micro-controller running a RTOS, with an ISR(Interrupt Service Routine) handling UART (Serial port) interrupts. When the UART raises an interrupt, the ISR posts the received characters into the circular buffer. In another RTOS task (called packet_handler), I am reading from this circular buffer and running a state machine to decode the packet. A valid packet has 64 bytes including all the framing bytes.
The UART operates at 115200, and a packet arrives every 10ms. The packet_handler runs every 10ms. However, this handler may sometimes get delayed by the scheduler due to another higher priority task executing.
If I use an arbitrarily large circular buffer, there are no packet drops. But how to determine the optimal circular buffer size in this case? At least theoretically. How is this decision of buffer size made in practice?
Currently, I am detecting overrun of the buffer through some instrumentation functions and then increasing the buffer size to reduce packet loss.

You won't be safe, ever, as you are dealing with a stochastic process (according to your explanation).
Answering your question: You will need an infinite buffer just in case the consumer task is in ready state for infinite seconds. So, you will have to change something in your initial approach:
Increase the priority of the consumer, in order to ensure the 10ms execution (the smallest buffer approach, but it may not be possible).
Try to get a better characterization of your model, in order to predict the maximum gap of time in which the consumer task won't be executed (do your system as predictable as possible).
Lose packages with a random buffer size (it may not be safe)

I would calculate in this way:
64 Byte received just know
64 Byte still in the ring buffer
+100% to be save
===================
256 Byte Buffer
But this is just a guess. You had to do some worst case test with your buffer and then spend +100% to be save.

While all of the above answers are correct and throws light on the issue, this page summarizes all the factors to be considered while choosing the size of a ring buffer.
Some queuing models can be used to theoretically analyze the problem at hand and find out the suitable size of ring buffer.
A more pragmatic approach is to start with a large buffer, then find out the maximum used buffer size in real test case (this process is called watermarking) and use this figure in the final code.

It is simply a matter of determining the maximum possible delay - the sum of the execution times of all higher priority tasks that can run - and dividing that by the packet arrival interval.
This may not be straightforward, but can be simplified by making only the most deterministic tasks higher priority and moving less deterministic and longer running tasks to lower priorities according to rate-monotonic principles Tasks that frequently run for a short time, but sporadically run for longer are candidates for being split into two tasks (and further queues) to offload the longer execution to a lower priority.

Related

How to run a periodic thread in high frequency(> 100kHz) in a Cortex-M3 microcontroller in an RTOS?

I'm implementing a high frequency(>100kHz) Data acquisition system with an STM32F107VC microcontroller. It uses the spi peripheral to communicate with a high frequency ADC chip. I have to use an RTOS. How can I do this?
I have tried FreeRTOS but its maximum tick frequency is 1000Hz so I can't run a thread for example every 1us with FreeRTOS. I also tried Keil RTX5 and its tick frequency can be up to 1MHz but I studied somewhere that it is not recommended to set the tick frequency high because it increases the overall context switching time. So what should I do?
Thanks.
You do not want to run a task at this frequency. As you mentioned, context switches will kill the performance. This is horribly inefficient.
Instead, you want to use buffering, interrupts and DMA. Since it's a high frequency ADC chip, it probably has an internal buffer of its own. Check the datasheet for this. If the chip has a 16 samples buffer, a 100kHz sampling will only need processing at 6.25kHz. Now don't use a task to process the samples at 6.25kHz. Do the receiving in an interrupt (timer or some signal), and the interrupt should only fill a buffer, and wake up a task for processing when the buffer is full (and switch to another buffer until the task has finished). With this you can have a task that runs only every 10ms or so. An interrupt is not a context switch. On a Cortex-M3 it will have a latency of around 12 cycles, which is low enough to be negligible at 6.25kHz.
If your ADC chip doesn't have a buffer (but I doubt that), you may be ok with a 100kHz interrupt, but put as little code as possible inside.
A better solution is to use a DMA if your MCU supports that. For example, you can setup a DMA to receive from the SPI using a timer as a request generator. Depending on your case it may be impossible or tricky to configure, but a working DMA means that you can receive a large buffer of samples without any code running on your MCU.
I have to use an RTOS.
No way. If it's a requirement by your boss or client, run away from the project fast. If that's not possible, communicate your concerns in writing now to save your posterior when the reasons of failure will be discussed. If it's your idea, then reconsider now.
The maximum system clock speed of the STM32F107 is 36 MHz (72 if there is an external HSE quartz), meaning that there are only 360 to 720 system clock cycles between the ticks coming at 100 kHz. The RTX5 warning is right, a significant amount of this time would be required for task switching overhead.
It is possible to have a timer interrupt at 100 kHz, and do some simple processing in the interrupt handler (don't even think about using HAL), but I'd recommend investigating first whether it's really necessary to run code every 10 μs, or is it possible to offload something that it would do to the DMA or timer hardware.
Since you only have a few hundred cycles (instructions) between input, the typical solution is to use an interrupt to be alerted that data is available, and then the interrupt handler put the data somewhere so you can process them at your leisure. Of course if the data comes in continuously at that rate, you maybe in trouble with no time for actual processing. Depending on how much data is coming in and how frequent, a simple round buffer maybe sufficient. If the amount of data is relatively large (how large is large? Consider that it takes more than one CPU cycle to do a memory access, and it takes 2 memory accesses per each datum that comes in), then using DMA as #Elderbug suggested is a great solution as that consumes the minimal amount of CPU cycles.
There is no need to set the RTOS tick to match the data acquisition rate - the two are unrelated. And to do so would be a very poor and ill-advised solution.
The STM32 has DMA capability for most peripherals including SPI. You need to configure the DMA and SPI to transfer a sequence of samples directly to memory. The DMA controller has full and half transfer interrupts, and can cycle a provided buffer so that when it is full, it starts again from the beginning. That can be used to "double buffer" the sample blocks.
So for example if you use a DMA buffer of say 256 samples and sample at 100Ksps, you will get a DMA interrupt every 1.28ms independent of the RTOS tick interrupt and scheduling. On the half-transfer interrupt the first 128 samples are ready for processing, on the full-transfer, the second 128 samples can be processed, and in the 1.28ms interval, the processor is free to do useful work.
In the interrupt handler, rather then processing all the block data in the interrupt handler - which would not in any case be possible if the processing were non-deterministic or blocking, such as writing it to a file system - you might for example send the samples in blocks via a message queue to a task context that performs the less deterministic processing.
Note that none of this relies on the RTOS tick - the scheduler will run after any interrupt if that interrupt calls a scheduling function such as posting to a message queue. Synchronising actions to an RTOS clock running asynchronously to the triggering event (i.e. polling) is not a good way to achieve highly deterministic real-time response and is a particularly poor method for signal acquisition, which requires a jitter free sampling interval to avoid false artefacts in the signal from aperiodic sampling.
Your assumption that you need to solve this problem by an inappropriately high RTOS tick rate is to misunderstand the operation of the RTOS, and will probably only work if your processor is doing no other work beyond sampling data - in which case you might not need an RTOS at all, but it would not be a very efficient use of the processor.

Why do we need to specify the number of flash wait cycles?

Especially when working with "faster" devices like STMF4xx/F7xx we need to specify the number of flash wait cycles, based on the supply voltage and the sys-clock frequency.
When the CPU fetches instructions/or constants this is done over the FLITF. Am I right with the assumption that the FLITF holds a CPU request as long as it can provide the requested data, making it impossible for other Bus-Masters to access flash meanwhile.
If this was true, why should it be important to any interface to know flash wait cycles. Like Cache does preload instructions so or so, independent if it knows how long to wait, no?
Because the flash interface isn't magic.
It has to meet the necessary setup and hold times for addressing and reading out the flash cells, which will vary somewhat depending on voltage. Taking the STM32F411 as an example (because I have that TRM handy), doing some maths with the voltage/frequency/wait-state table implies that a read from flash on one of those takes in the order of ~30ns above 2.7V, down to ~60ns below 2.1V.
Since the flash interface doesn't have its own asynchronous nanosecond-precision timekeeping ability (because that would be needlessly complicated, power-hungry, and silly), that translates to asserting its signals for n clock cycles, after which it can assume the data signals from the cells are stable enough to read back*. How does it know what the clock frequency is, and therefore what n should be? Simple: you, as the programmer who set the clock, tell it. Some hardware things are just infinitely easier to let software deal with.
* and then going through the further shenanigans of extracting the relevant 8, 16 or 32 bits out of the 128-bit line it's read, to finally spit that out the other side onto the AHB bus to the waiting CPU, obviously.

ALSA passthrough latency

I am working on an embedded Linux application with audio passthrough using ALSA. It has very stringent latency requirements.
The output buffer is as small as possible which results in an occasional (perhaps once an hour) underrun on the output. This is acceptable. However, when it occurs, it causes a "backup" in the capture buffer and the result is a creeping increase in latency.
There doesn't seem to be a reliable way to know how much output data was lost in order to discard the same amount of input. I can experiment, but even though it's an embedded application it needs to be device independent, so we need a reliable solution.
Does anyone know a way to determine how much data was lost, or if it is always one buffer, or have other suggestions?
If you do not want the PCM devices to stop on an underrun/overrun, configure them not to stop by setting the stop threshold to the boundary value. Then they will just continue to run, and the number of available frames will continue to increase (for capture) or decrease (for playback). (Not all of those frames will be usable; the ring buffer just wraps around.)

How to run a multi-queue code using OpenCL?

For example, I'm doing some image processing work on every frame of a video.
Every frame's processing using 200ms including writing, processing and reading.
And the fps is 25, in that case every two frames' distance is 40ms. Then the processing is too slow to show continuous result.
So here is my idea, I use multi-queues for this work.
In CPU part,
while(video is not over)
{
1. read the frame0;
processing the **frame0** using **queue0**;
wait 40 ms;
2. read the frame1;
processing the frame1 using **queue1**;
wait 40 ms;
3.4.5.
...(after 5 frames(just about the 200ms's processing time))
6. download the **frame0**'s result.
7. read the frame5;
processing the frame5 using **queue0**;
wait 40 ms;
...
}
The code means that, I use different queues for reading and processing the same frame in a video.
However, my experiment result is faster, but just 2 times faster, but not in my imaginary speed.
Can anyone tell me how to deal it? THX!
Assuming you have one Device, here are some thoughts on this point:
Main reason to have multiple Command Queues (CQ) per single OpenCL Device is the ability to execute kernels & do IO operations simultaneously.
Usually one CQ is enought to load single Device at ~100%. Though, your multi-CQ idea is good (in my opinion), as you're constantly feeding GPU with workload.
Look at kernel execution time. May be, it's big enough, so that your Device is constantly executing kernels & can't go any faster.
I think, you don't need to wait for 40ms. Good solution is to process frames in queue, in which they are put to eliminate the difference between bitstream & display order.
If you have too many CQ, your OpenCL driver thread will be busy maintaining them, so that performance may decrease.

x264 threading latency

I wonder why sliceless threading (http://akuvian.org/src/x264/sliceless_threads.txt) in x264 leads to latency? If I have for example 2 threads the first encode one frame and the second encode one frame. The seconds have to wait for the first in some cases. But they can be encoded in parallel.
So two threads should be faster than only one, right?
Frame-threading add latency in frames not in seconds because you need to feed encoder with more input frames before you start getting output frames (to fill pipeline). Encoding one frame itself will take about near same processor time as with one thread but threading allow pipeline process by encoding different frames parallel. From other hand sliced-threading decrease latency because all threads encode one frame parallel so it would be finished faster than encoding it with one thread (also sliced-threading don't need latency in frames for pipepining).
It took me quite a while to reason through it, but the answer is Queuing Theory.
Each frame can be started when half of the previous frame has been encoded. But if parallelization is going to provide any benefit most (preferably all) threads should have a frame to work on. 5 threads means 5 frames. That is the pipeline. Any time the pipeline is not completely full, parallelization is giving you less of a benefit. If the pipeline contains only one frame, only one thread is working and therefore you get no benefit from parallelization. But if your pipeline is usually full, what is it full of? Unencoded frames. Unencoded frames are frames that must have been captured and therefore they represent that many frames worth of latency. The latency might be slightly less by a small constant portion of a frame because some of those frames in the pipeline are partially encoded but in general each item in the pipeline contributes to the latency.
One reason for added latency with more threads is that the consecutive frames use each other for motion prediction and compensation. That means in order to compress a frame you need info from previous motion estimation details. That means the frames are dependant on each other and sometimes they have to wait for at least some data from other threads as well. This is in contrast with the slice threading when threads slicing up the frame and each one works on one slice and all on the same frame and they have all the needed info from previous frames, or next in case of B frames.