Uart dma receive interrupt stops receiving data after several minutes - mutex

I have a project that I have used stm32f746g discovery board. It receives data with fixed size from Uart sequentially and to inform application about each data receive completed, dma callback is used (HAL_UART_RxCpltCallback function). It works fine at the beginning but after several minutes of running, the dma callback stops to be called, and as a result, the specified parameter value doesn't get updated. Because the parameter is used in another thread too (actually a rtos defined timer), I believe this problem is caused by lacking of thread safety. But my problem is that mutex and semaphore don't be supported in ISRs and I need to protect my variable in dma callback which is an interrupt routine. I am using keil rtx to handle multithreading and the timer I use is osTimer that is defined in rtx. How can I handle the issue?

Generally, only one thread should communicate with the ISR. If multiple threads are accessing a variable shared with an ISR, your design is wrong and needs to be fixed. In case of DMA, only one thread should access the buffer.
You'll need to protect the variables shared between that thread and the ISR - not necessarily with a mutex/semaphore but perhaps with something simpler like guaranteeing atomic access (best solution if possible), or by using the non-interrruptable abilitiy that many ISRs have. Example for simple, single-threaded MCU applications. Alternatively just temporarily disable interrupts during access, but that may not be possible, depending on real-time requirements.

Related

STM32MP157F dmaengine: dmaengine_prep_dma_memcpy works, dmaengine_prep_dma_cyclic does not

​
We have implemented our custom driver that uses DMA to copy a large amount of data from the FMC interface (an FPGA mapped to it) to the RAM using the STM32 mdma engine with 32 dma channels. The FPGA contains a small FIFO we want to copy the data from.
​
For very fast data acquisition the setup time for new DMA transactions becomes critical!
The first implementation used a workqueue to create the next DMA transaction. It could not be done directly from the "dma_completed" atomic context though some necessary IO that has to wait. This lead to pauses between DMA transaction up to 5ms and buffer overflows in the FPGAs FIFO.
As I am copying from a memory mapped region to RAM, I am using dmaengine_prep_dma_memcpy.
I implemented a number of improvements that reduced the pause betweens DMAs:
I am fusing dma mapped pages so that less dma transaction entries have to be created so less dma engine programming is necessary.
I am preparing the next dma pages upfront. So the next DMA transaction can be directly started from the "dma_completed" routine.
I am using a second dma channel and toggle between them when dma_completed is called. This allows to setup a second DMA with the first one still running. Though linux dma api allows this with one channel, the MDMA engine does not and ignores the added transactions.
Usually the pause is now lower than 1ms. But there a spikes were the FIFO nearly overflowing.
Finally I tried to use dmaengine_prep_dma_cyclic. This would be perfect. A continuously running DMA with no need for a setup time between interrupts.
​But this does not work. Or better: I do not get it to work...
The transaction created with dmaengine_prep_dma_cyclic does not want to start!
I am getting a new dma_cookie and any status request to the channel returns "DMA_IN_PROGRESS". It never completes and the completetion callback is also never called.
Though dmaengine_prep_dma_memcpy works fine...
I think this is because of the difference between software vs hardware triggered DMA transactions.
Looking into stm32-mdma.c is see that dmaengine_prep_dma_memcpy has its own setup routine whereas dmaengine_prep_dma_cyclic use stm32_mdma_set_xfer_param() that always configures a HW request.
My very big big questions:
Is there a way to use dmaengine_prep_dma_cyclic for a MEMORY to MEMORY DMA transaction (software triggered)? This would be the perfect solution to my performance problem...
​Are we missing some signals to connect the FPGA to the SOC? My FPGA programming collegue suspects some missing TSEL (trigger selection) setting. He suspects dmaengine_prep_dma_cyclic will work then.
If a minimum driver module source code example would help in getting better answers, I can provide one in short time. Please note that this is highly hardware specific. Other SOCs than STM32MP157F may have different behaviour.
Thanks for every feedback!
Bye Gunther
References:
https://wiki.st.com/stm32mpu/wiki/Dmaengine_overview
https://github.com/STMicroelectronics/linux/blob/v5.15-stm32mp/drivers/dma/stm32-mdma.c

Why do we have multiple Interrupt Handlers in a system rather than just one?

In the Kernel of an operating system, we have an Interrupt table that contains many interrupt handlers that handle interrupts from I/O devices and processes. But why can't we just have one interrupt handler? Are interrupt handlers any different from each other?
Another issue is that, with one interrupt-handler, it gets very messy to prioritize interrupts.
Usually, interrupts are disabled in hardware once an interrupt is acknowledged by the CPU that handles it, so preventing multiple, reentrant invocations of the same interrupt and any issues with data/buffer overwrites that would likely ensue. It's also common for an interrupt-handler to propmptly re-enable interrupts of a higher priority, so improving response to those interrupts, (they can then interrupt the lower-priority interrupts).
Using only one interrupt-handler would make prioritizing interrupts exrtemely messy, if possible at all:(
Getting interrupt handlers and drivers to cooperate harmoniously is difficult enough as it is!
Are interrupt handlers any different from each other?
Well, yes. They may all be forced to conform to a set of rules/constraints by the OS design, but yes, they are generally different. A handler that manages an interrupt from a disk DMA controller will surely have different code than a keyboard input handler. They manage different hardware, to start with:)
If you have one interrupt handler, the decision for how the interrupt should be processed is made in code instead of in hardware.
And there are a LOT of things that can trigger an interrupt - so the code would almost certainly reduce overall performance.
In principle there is no reason why you could not have a single interrupt handler that gets called for all interrupts. Such a handler would have to check every single interrupt source. Since most of the time only a small fraction of the possible interrupt sources are active, many cycles would get wasted checking to see which interrupt was triggered. Since ISR routines are generally very frequently called code you would take a large (probably unacceptable) performance hit.
The way a specific interrupt controller handles interrupts can vary quite a bit. To get a very solid understanding you'd have to read the manuals for a variety of different architectures interrupt controller implementations.
However, some interrupt controllers do end up sharing a common ISR routine. The common ISR will read registers in the interrupt controller to determine what vector (basically the source of the interrupt) was triggered. The common ISR then calls another function (that is often referred to as an interrupt service routine as well) that handles that specific interrupt source based off the vector value. Then depending on the implementation of the interrupt controller when the vector specific routine returns control to the common ISR, the common ISR will deassert the interrupt on the interrupt controller returning execution back to the interrupted place in code. Thus, by reading the vector from the a register in the interrupt controller cycles are saved because the common ISR knows what caused the interrupt, rather than examining every possible interrupt source.

Should I use IOCPs or overlapped WSASend/Receive?

I am investigating the options for asynchronous socket I/O on Windows. There is obviously more than one option: I can use WSASend... with an overlapped structure providing either a completion callback or an event, or I could use IOCPs and the (new) thread pool. From I usually read, the latter option is the recommended one.
However, it is not clear to me, why I should use IOCPs if the completion routine suffices for my goal: tell the socket to send this block of data and inform me if it is done.
I understand that the IOCP stuff in combination with CreateThreadpoolIo etc. uses the OS thread pool. However, the "normal" overlapped I/O must also use separate threads? So what is the difference/disadvantage? Is my callback called by an I/O thread and blocks other stuff?
Thanks in advance,
Christoph
You can use either but, for servers, IOCP with the 'completion queue' will have better performance, in general, because it can use multiple client<>server threads, either with CreateThreadpoolIo or some user-space thread pool. Obviously, in this case, dedicated handler threads are usual.
Overlapped completion-routine I/O is more useful for clients, IMHO. The completion-routine is fired by an Asynchronous Procedure Call that is queued to the thread that initiated the I/O request, (WSASend, WSARecv). This implies that that thread must be in a position to process the APC and typically this means a while(true) loop around some 'blahEx()' call. This can be useful because it's fairly easy to wait on a blocking queue, or other inter-thread signal, that allows the thread to be supplied with data to send and the completion routine is always handled by that thread. This I/O mechanism leaves the 'hEvent' OVL parameter free to use - ideal for passing a comms buffer object pointer into the completion routine.
Overlapped I/O using an actual synchro event/Semaphore/whatever for the overlapped hEvent parameter should be avoided.
Windows IOCP documentation recommends no more than one thread per available core per completion port. Hyperthreading doubles the number of cores. Since use of IOCPs results in a for all practical purposes event-driven application the use of thread pools adds unnecessary processing to the scheduler.
If you think about it you'll understand why: an event should be serviced in its entirety (or placed in some queue after initial processing) as quickly as possible. Suppose five events are queued to an IOCP on a 4-core computer. If there are eight threads associated with the IOCP you run the risk of the scheduler interrupting one event to begin servicing another by using another thread which is inefficient. It can be dangerous too if the interrupted thread was inside a critical section. With four threads you can process four events simultaneously and as soon as one event has been completed you can start on the last remaining event in the IOCP queue.
Of course, you may have thread pools for non-IOCP related processing.
EDIT________________
The socket (file handles work fine too) is associated with an IOCP. The completion routine waits on the IOCP. As soon as a requested read from or write to the socket completes the OS - via the IOCP - releases the completion routine waiting on the IOCP and returns with the additional information you provided when you called the read or write (I usually pass a pointer to a control block). So the completion routine immediately "knows" where the to find information pertinent to the completion.
If you passed information referring to a control block (similar) then that control block (probably) needs to keep track of what operation has completed so it knows what to do next. The IOCP itself neither knows nor cares.
If you're writing a server attached to the internet, the server would issue a read to wait for client input. That input may arrive a milli-second or a week later and when it does the IOCP will release the completion routine which analyzes the input. Typically it responds with a write containing the data requested in the input and then waits on the IOCP. When the write completed the IOCP again releases the completion routine which sees that the write has completed, (typically) issues a new read and a new cycle starts.
So an IOCP-based application typically consumes very little (or no) CPU until the moment a completion occurs at which time the completion routine goes full tilt until it has finished processing, sends a new I/O request and again waits on the completion port. Apart from the IOCP timeout (which can be used to signal house-keeping or such) all I/O-related stuff occurs in the OS.
To further complicate (or simplify) things it is not necessary that sockets be serviced using the WSA routines, the Win32 functions ReadFile and WriteFile work just fine.

What are the differences between Clock and I/O interrupts?

What are the differences between clock and I/O interrupts?
As I understand it a clock interrupt uses the system clock for interrupting the CPU and an I/O interrupt is sent to the CPU based off of program input or output completion. This was helpful in understanding interrupts in general, but I'm trying to compare these two kinds.
edit:
In a multiprogramming context, using a uniprocessor (to make things simple)
Timer/clock interrupts are often used for scheduling. These interrupts invoke the scheduler and it may switch the currently executing thread/process to another by saving the current context and loading another one.
Other than the purpose, an interrupt is an interrupt.
The main purpose of clock interrupt is to help out in what we call it "Multitasking". It deceives us and make us to think that internally parallel working is going on (Means many applications are running at the same time).But in reality it's not.Clock sends interrupt after a specified fraction of second,depends on system speed, to the processor to terminate it's current thread, save its address and data to stake and hold the application of which interrupt is sent.
i hope this will help you.

Relationship between a kernel and a user thread

Is there a relationship between a kernel and a user thread?
Some operating system textbooks said that "maps one (many) user thread to one (many) kernel thread". What does map means here?
When they say map, they mean that each kernel thread is assigned to a certain number of user mode threads.
Kernel threads are used to provide privileged services to applications (such as system calls ). They are also used by the kernel to keep track of what all is running on the system, how much of which resources are allocated to what process, and to schedule them.
If your applications make heavy use of system calls, more user threads per kernel thread, and your applications will run slower. This is because the kernel thread will become a bottleneck, since all system calls will pass through it.
On the flip side though, if your programs rarely use system calls (or other kernel services), you can assign a large number of user threads to a kernel thread without much performance penalty, other than overhead.
You can increase the number of kernel threads, but this adds overhead to the kernel in general, so while individual threads will be more responsive with respect to system calls, the system as a whole will become slower.
That is why it is important to find a good balance between the number of kernel threads and the number of user threads per kernel thread.
http://www.informit.com/articles/printerfriendly.aspx?p=25075
Implementing Threads in User Space
There are two main ways to implement a threads package: in user space and in the kernel. The choice is moderately controversial, and a hybrid implementation is also possible. We will now describe these methods, along with their advantages and disadvantages.
The first method is to put the threads package entirely in user space. The kernel knows nothing about them. As far as the kernel is concerned, it is managing ordinary, single-threaded processes. The first, and most obvious, advantage is that a user-level threads package can be implemented on an operating system that does not support threads. All operating systems used to fall into this category, and even now some still do.
All of these implementations have the same general structure, which is illustrated in Fig. 2-8(a). The threads run on top of a run-time system, which is a collection of procedures that manage threads. We have seen four of these already: thread_create, thread_exit, thread_wait, and thread_yield, but usually there are more.
When threads are managed in user space, each process needs its own private thread table to keep track of the threads in that process. This table is analogous to the kernel's process table, except that it keeps track only of the per-thread properties such the each thread's program counter, stack pointer, registers, state, etc. The thread table is managed by the run-time system. When a thread is moved to ready state or blocked state, the information needed to restart it is stored in the thread table, exactly the same way as the kernel stores information about processes in the process table.
When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers (i.e., its own) in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine has an instruction to store all the registers and another one to load them all, the entire thread switch can be done in a handful of instructions. Doing thread switching like this is at least an order of magnitude faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.
However, there is one key difference with processes. When a thread is finished running for the moment, for example, when it calls thread_yield, the code of thread_yield can save the thread's information in the thread table itself. Furthermore, it can then call the thread scheduler to pick another thread to run. The procedure that saves the thread's state and the scheduler are just local procedures, so invoking them is much more efficient than making a kernel call. Among other issues, no trap is needed, no context switch is needed, the memory cache need not be flushed, and so on. This makes thread scheduling very fast.
User-level threads also have other advantages. They allow each process to have its own customized scheduling algorithm. For some applications, for example, those with a garbage collector thread, not having to worry about a thread being stopped at an inconvenient moment is a plus. They also scale better, since kernel threads invariably require some table space and stack space in the kernel, which can be a problem if there are a very large number of threads.
Despite their better performance, user-level threads packages have some major problems. First among these is the problem of how blocking system calls are implemented. Suppose that a thread reads from the keyboard before any keys have been hit. Letting the thread actually make the system call is unacceptable, since this will stop all the threads. One of the main goals of having threads in the first place was to allow each one to use blocking calls, but to prevent one blocked thread from affecting the others. With blocking system calls, it is hard to see how this goal can be achieved readily.
The system calls could all be changed to be nonblocking (e.g., a read on the keyboard would just return 0 bytes if no characters were already buffered), but requiring changes to the operating system is unattractive. Besides, one of the arguments for user-level threads was precisely that they could run with existing operating systems. In addition, changing the semantics of read will require changes to many user programs.
Another alternative is possible in the event that it is possible to tell in advance if a call will block. In some versions of UNIX, a system call, select, exists, which allows the caller to tell whether a prospective read will block. When this call is present, the library procedure read can be replaced with a new one that first does a select call and then only does the read call if it is safe (i.e., will not block). If the read call will block, the call is not made. Instead, another thread is run. The next time the run-time system gets control, it can check again to see if the read is now safe. This approach requires rewriting parts of the system call library, is inefficient and inelegant, but there is little choice. The code placed around the system call to do the checking is called a jacket or wrapper.
Somewhat analogous to the problem of blocking system calls is the problem of page faults. We will study these in Chap. 4. For the moment, it is sufficient to say that computers can be set up in such a way that not all of the program is in main memory at once. If the program calls or jumps to an instruction that is not in memory, a page fault occurs and the operating system will go and get the missing instruction (and its neighbors) from disk. This is called a page fault. The process is blocked while the necessary instruction is being located and read in. If a thread causes a page fault, the kernel, not even knowing about the existence of threads, naturally blocks the entire process until the disk I/O is complete, even though other threads might be runnable.
Another problem with user-level thread packages is that if a thread starts running, no other thread in that process will ever run unless the first thread voluntarily gives up the CPU. Within a single process, there are no clock interrupts, making it impossible to schedule processes round-robin fashion (taking turns). Unless a thread enters the run-time system of its own free will, the scheduler will never get a chance.
One possible solution to the problem of threads running forever is to have the run-time system request a clock signal (interrupt) once a second to give it control, but this, too, is crude and messy to program. Periodic clock interrupts at a higher frequency are not always possible, and even if they are, the total overhead may be substantial. Furthermore, a thread might also need a clock interrupt, interfering with the run-time system's use of the clock.
Another, and probably the most devastating argument against user-level threads, is that programmers generally want threads precisely in applications where the threads block often, as, for example, in a multithreaded Web server. These threads are constantly making system calls. Once a trap has occurred to the kernel to carry out the system call, it is hardly any more work for the kernel to switch threads if the old one has blocked, and having the kernel do this eliminates the need for constantly making select system calls that check to see if read system calls are safe. For applications that are essentially entirely CPU bound and rarely block, what is the point of having threads at all? No one would seriously propose computing the first n prime numbers or playing chess using threads because there is nothing to be gained by doing it that way.
User threads are managed in userspace - that means scheduling, switching, etc. are not from the kernel.
Since, ultimately, the OS kernel is responsible for context switching between "execution units" - your user threads must be associated (ie., "map") to a kernel schedulable object - a kernel thread†1.
So, given N user threads - you could use N kernel threads (a 1:1 map). That allows you to take advantage of the kernel's hardware multi-processing (running on multiple CPUs) and be a pretty simplistic library - basically just deferring most of the work to the kernel. It does, however, make your app portable between OS's as you're not directly calling the kernel thread functions. I believe that POSIX Threads (PThreads) is the preferred *nix implementation, and that it follows the 1:1 map (making it virtually equivalent to a kernel thread). That, however, is not guaranteed as it'd be implementation dependent (a main reason for using PThreads would be portability between kernels).
Or, you could use only 1 kernel thread. That'd allow you to run on non multitasking OS's, or be completely in charge of scheduling. Windows' User Mode Scheduling is an example of this N:1 map.
Or, you could map to an arbitrary number of kernel threads - a N:M map. Windows has Fibers, which would allow you to map N fibers to M kernel threads and cooperatively schedule them. A threadpool could also be an example of this - N workitems for M threads.
†1: A process has at least 1 kernel thread, which is the actual execution unit. Also, a kernel thread must be contained in a process. OS's must schedule the thread to run - not the process.
This is a question about thread library implement.
In Linux, a thread (or task) could be in user space or in kernel space. The process enter kernel space when it ask kernel to do something by syscall(read, write or ioctl).
There is also a so-called kernel-thread that runs always in kernel space and does not represent any user process.
According to Wikipedia and Oracle, user-level threads are actually in a layer mounted on the kernel threads; not that kernel threads execute alongside user-level threads but that, generally speaking, the only entities that are actually executed by the processor/OS are kernel threads.
For example, assume that we have a program with 2 user-level threads, both mapped to (i.e. assigned) the same kernel thread. Sometimes, the kernel thread runs the first user-level thread (and it is said that currently this kernel thread is mapped to the first user-level thread) and some other times the kernel thread runs the second user-level thread. So we say that we have two user-level threads mapped to the same kernel thread.
As a clarification:
The core of an OS is called its kernel, so the threads at the kernel level (i.e. the threads that the kernel knows of and manages) are called kernel threads, the calls to the OS core for services can be called kernel calls, and ... . The only definite relation between kernel things is that they are strongly related to the OS core, nothing more.