FreeRTOS blocking on multiple events/objects - queue

In the UDP/IP Stack solution example, here, there is a proposed solution for blocking on a single event queue.
What would be the go to solution for protecting the data that the pointer points to until it has been handled by the task waiting for the queue.
Say for example that the queue is filled from a ISR. The ISR should not write to *pvData if it has not been processed by the appropriate task. But since there can be several event sources the queue should probably be longer than one item. Should the struct be made:
typedef struct IP_TASK_COMMANDS
{
eIPEvent_t eEventType;
SemaphoreHandle_t data_read;
void *pvData;
} xIPStackEvent_t;
With the semaphore taken in the ISR and given in the task that processes the data when it's done with it.

If you take the UDP example - normally you would have a pool of buffers (or dynamically allocate a buffer) from which a buffer would be obtained and given to the DMA. When the DMA fills the buffer with received data a pointer to the buffer goes into the UDP stack - at which point only the UDP stack knows where the buffer is and is responsible for it. At some point the data in the buffer may get passed from the UDP stack to the application where it can be consumed. The application then returns the buffer to the pool (or frees the allocated buffer) so it is available to the DMA again. The reverse is also true - the application may allocate a buffer that is filled with data to be sent, via the UDP stack, to the Tx function where it is actually placed onto the wire - in which case it is the Tx end interrupt that returns the buffer to the pool.
So, in short, there is only one thing that has a reference to the buffer at a time, so there is no problem.
[note above where it says the application allocates or frees a buffer, that would be inside the UDP/IP stack API called by the application rather than by the application directly - this is in fact partly at least how our own TCP/IP stack is implemented.]

You don't want your ISR to block and wait for the data buffer to become available. If it's appropriate for your ISR to just skip the update and move on when the buffer is not available then perhaps a semaphore makes sense. But the ISR should not block on the semaphore.
Here's an alternative to consider. Make a memory pool containing multiple appropriately sized data buffers. The ISR allocates the next available buffer from the pool, writes the data to it and puts the pointer to it on the queue. The task reads the pointer from the queue, uses the data, and then frees the buffer back to the pool. If the ISR runs again before the task uses the data, the ISR will be allocating a fresh buffer so it won't be overwriting the previous data.
Here's another consideration. The FreeRTOS Queue passes items by copy. If the data buffer is relatively small then perhaps it makes sense to just pass the data structure rather than a pointer to the data structure. If you pass the data structure then the queue service will make a copy to provide to the task and the ISR is free to update it's original buffer.
Now that I think about it, using the Queue service copy feature may be simpler than creating your own separate memory pool. When you create the Queue and specify the queue length and item size, I believe the queue service creates a memory pool to be used by the queue.

Related

Can a mutex be used instead of a critical section when using stream buffers in FreeRTOS?

I am looking into using a stream buffer in FreeRTOS to transfer CAN frames from multiple tasks to an ISR, which puts them into a CAN transmit buffer as soon as it's ready. The manual here explains that a stream buffer should only be used by one task/isr and read by one task/isr, and if not, then a critical section is required.
Can a mutex be used in place of a critical section for this scenario? Would it make more sense to use?
First, if you are sending short discrete frames, you may want to consider a message buffer instead of a stream buffer.
Yes you could use a mutex.
If sending from multiple tasks the main thing to consider is what happens when the stream buffer becomes full. If you were using a different FreeRTOS object (other than a message buffer, message buffers being built on stream buffers) then multiple tasks attempting to write to the same instance of an object that was full would all block on their attempt to write to the object, and be automatically unblocked when space in the object became available - the highest priority waiting task would be the first to be unblocked no matter the order in which the tasks entered the blocked state. However, with stream/message buffers you can only have one task blocked attempting to write to a full buffer - and if the buffer was protected by a muted - all other tasks would instead block on the mutex. That could mean that a low priority task was blocked on the stream/message buffer while a higher priority task was blocked on the mutex - a kind of priority inversion.

Should I use IOCPs or overlapped WSASend/Receive?

I am investigating the options for asynchronous socket I/O on Windows. There is obviously more than one option: I can use WSASend... with an overlapped structure providing either a completion callback or an event, or I could use IOCPs and the (new) thread pool. From I usually read, the latter option is the recommended one.
However, it is not clear to me, why I should use IOCPs if the completion routine suffices for my goal: tell the socket to send this block of data and inform me if it is done.
I understand that the IOCP stuff in combination with CreateThreadpoolIo etc. uses the OS thread pool. However, the "normal" overlapped I/O must also use separate threads? So what is the difference/disadvantage? Is my callback called by an I/O thread and blocks other stuff?
Thanks in advance,
Christoph
You can use either but, for servers, IOCP with the 'completion queue' will have better performance, in general, because it can use multiple client<>server threads, either with CreateThreadpoolIo or some user-space thread pool. Obviously, in this case, dedicated handler threads are usual.
Overlapped completion-routine I/O is more useful for clients, IMHO. The completion-routine is fired by an Asynchronous Procedure Call that is queued to the thread that initiated the I/O request, (WSASend, WSARecv). This implies that that thread must be in a position to process the APC and typically this means a while(true) loop around some 'blahEx()' call. This can be useful because it's fairly easy to wait on a blocking queue, or other inter-thread signal, that allows the thread to be supplied with data to send and the completion routine is always handled by that thread. This I/O mechanism leaves the 'hEvent' OVL parameter free to use - ideal for passing a comms buffer object pointer into the completion routine.
Overlapped I/O using an actual synchro event/Semaphore/whatever for the overlapped hEvent parameter should be avoided.
Windows IOCP documentation recommends no more than one thread per available core per completion port. Hyperthreading doubles the number of cores. Since use of IOCPs results in a for all practical purposes event-driven application the use of thread pools adds unnecessary processing to the scheduler.
If you think about it you'll understand why: an event should be serviced in its entirety (or placed in some queue after initial processing) as quickly as possible. Suppose five events are queued to an IOCP on a 4-core computer. If there are eight threads associated with the IOCP you run the risk of the scheduler interrupting one event to begin servicing another by using another thread which is inefficient. It can be dangerous too if the interrupted thread was inside a critical section. With four threads you can process four events simultaneously and as soon as one event has been completed you can start on the last remaining event in the IOCP queue.
Of course, you may have thread pools for non-IOCP related processing.
EDIT________________
The socket (file handles work fine too) is associated with an IOCP. The completion routine waits on the IOCP. As soon as a requested read from or write to the socket completes the OS - via the IOCP - releases the completion routine waiting on the IOCP and returns with the additional information you provided when you called the read or write (I usually pass a pointer to a control block). So the completion routine immediately "knows" where the to find information pertinent to the completion.
If you passed information referring to a control block (similar) then that control block (probably) needs to keep track of what operation has completed so it knows what to do next. The IOCP itself neither knows nor cares.
If you're writing a server attached to the internet, the server would issue a read to wait for client input. That input may arrive a milli-second or a week later and when it does the IOCP will release the completion routine which analyzes the input. Typically it responds with a write containing the data requested in the input and then waits on the IOCP. When the write completed the IOCP again releases the completion routine which sees that the write has completed, (typically) issues a new read and a new cycle starts.
So an IOCP-based application typically consumes very little (or no) CPU until the moment a completion occurs at which time the completion routine goes full tilt until it has finished processing, sends a new I/O request and again waits on the completion port. Apart from the IOCP timeout (which can be used to signal house-keeping or such) all I/O-related stuff occurs in the OS.
To further complicate (or simplify) things it is not necessary that sockets be serviced using the WSA routines, the Win32 functions ReadFile and WriteFile work just fine.

Why is epoll faster than select?

I have seen a lot of comparisons which says select have to walk through the fd list, and this is slow. But why doesn't epoll have to do this?
There's a lot of misinformation about this, but the real reason is this:
A typical server might be dealing with, say, 200 connections. It will service every connection that needs to have data written or read and then it will need to wait until there's more work to do. While it's waiting, it needs to be interrupted if data is received on any of those 200 connections.
With select, the kernel has to add the process to 200 wait lists, one for each connection. To do this, it needs a "thunk" to attach the process to the wait list. When the process finally does wake up, it needs to be removed from all 200 wait lists and all those thunks need to be freed.
By contrast, with epoll, the epoll socket itself has a wait list. The process needs to be put on only that one wait list using only one thunk. When the process wakes up, it needs to be removed from only one wait list and only one thunk needs to be freed.
To be clear, with epoll, the epoll socket itself has to be attached to each of those 200 connections. But this is done once, for each connection, when it is accepted in the first place. And this is torn down once, for each connection, when it is removed. By contrast, each call to select that blocks must add the process to every wait queue for every socket being monitored.
Ironically, with select, the largest cost comes from checking if sockets that have had no activity have had any activity. With epoll, there is no need to check sockets that have had no activity because if they did have activity, they would have informed the epoll socket when that activity happened. In a sense, select polls each socket each time you call select to see if there's any activity while epoll rigs it so that the socket activity itself notifies the process.
The main difference between epoll and select is that in select() the list of file descriptors to wait on only exists for the duration of a single select() call, and the calling task only stays on the sockets' wait queues for the duration of a single call. In epoll, on the other hand, you create a single file descriptor that aggregates events from multiple other file descriptors you want to wait on, and so the list of monitored fd's is long-lasting, and tasks stay on socket wait queues across multiple system calls. Furthermore, since an epoll fd can be shared across multiple tasks, it is no longer a single task on the wait queue, but a structure that itself contains another wait queue, containing all processes currently waiting on the epoll fd. (In terms of implementation, this is abstracted over by the sockets' wait queues holding a function pointer and a void* data pointer to pass to that function).
So, to explain the mechanics a little more:
An epoll file descriptor has a private struct eventpoll that keeps track of which fd's are attached to this fd. struct eventpoll also has a wait queue that keeps track of all processes that are currently epoll_waiting on this fd. struct epoll also has a list of all file descriptors that are currently available for reading or writing.
When you add a file descriptor to an epoll fd using epoll_ctl(), epoll adds the struct eventpoll to that fd's wait queue. It also checks if the fd is currently ready for processing and adds it to the ready list, if so.
When you wait on an epoll fd using epoll_wait, the kernel first checks the ready list, and returns immediately if any file descriptors are already ready. If not, it adds itself to the single wait queue inside struct eventpoll, and goes to sleep.
When an event occurs on a socket that is being epoll()ed, it calls the epoll callback, which adds the file descriptor to the ready list, and also wakes up any waiters that are currently waiting on that struct eventpoll.
Obviously, a lot of careful locking is needed on struct eventpoll and the various lists and wait queues, but that's an implementation detail.
The important thing to note is that at no point above there did I describe a step that loops over all file descriptors of interest. By being entirely event-based and by using a long-lasting set of fd's and a ready list, epoll can avoid ever taking O(n) time for an operation, where n is the number of file descriptors being monitored.

Manipulating shared data using mutex and semaphores

I wanted someone to resolve my confusion on this topic. It may sound simple, but am really confused.
In producer/consumer problem, I used 4-semaphore solution. I used a different lock for each of the critical sections.
Say,
Pseudo code of producer:
wait(slot) // counting sem
wait(mutex1) // binary sem
rear <-- rear + 1
buffer[rear] <-- item
signal (mutex1)
signal(items)
Where I use, "mutex2" as a second Mutex for my consumer, as "mutex1" in producer.
Now, my question is. If my producer and consumer is not using a buffer (rear and front) but using a stack, where only they can manipulate [top]. Do I need to use one mutex or two different locks as in my 4-semaphore, in order to ensure mutual exclusion.
Pseudo code of consumer with stack:
wait (message)
wait (mutex)
getspace <-- stack[top]
top – 1
signal (mutex)
signal (slot)
Personally, I think I need one lock for both procedures, so I make sure none of the producer and consumer access the top concurrently. But am not sure about that.
Thank you.
I'm not 100% sure that I follow your pseudo-code but I'll do my best to explain how to use semaphores to manage a stack from within the Producer-consumer process.
When you have a stack that is being accessed across multiple threads, you will need to lock it when the data is being accessed or, more specifically, when it is being pushed and popped. (This is always an underlying assumption of the Producer-consumer problem.)
We start off by defining a mutex that we will use to lock the stack.
Global Declaration of Process Semaphores
stackAccessMutex = semaphore(1) # The "(1)" is the count
# initializer for the semaphore.
Next, we will need to lock it when we are adding or removing data from it in our Consumer and Producer threads.
Producer thread
dataPushBuff #Buffer containing data to be pushed to the stack.
…dataPushBuff is assigned…
stackAccessMutex.wait()
stack.push(dataPushBuff)
stackAccessMutex.signal()
Consumer thread
dataRecvBuff = nil # Defining a variable to store the pushed
# content, accessible from only within
# the Consumer thread.
stackAccessMutex.wait()
dataRecvBuff = stack.pop()
stackAccessMutex.signal()
…Consume dataRecvBuff as needed since it's removed from the stack…
So far, everything is pretty straight forward. The Producer will lock the stack only when it needs to. The same is true for the consumer. We shouldn't need another semaphore, should we? Correct? No, wrong!
The above scenario makes one fatal assumption-- that the stack will always be initialized with data before it is popped. If the consumer thread executes before the producer thread gets a chance to pop any data, you will generate an error within your consumer thread because stack.pop() will not return anything! To fix this, we need to signal the consumer that data is available in the stack.
First, we need to define a semaphore that can be used to signal whether data in the stack exists or not.
Global Declaration of Process Semaphores, Version #2
stackAccessMutex = semaphore(1)
itemsInStack = semaphore(0)
We initialize our itemsInStack to the number of items in our stack, which is 0 (see 1).
Next, we need to implement our new semaphore into our Producer and Consumer threads. First, we need to have the Producer signal that an item has been added. Let's update the Producer now.
Producer thread, Version #2
dataPushBuff
…dataPushBuff is assigned…
stackAccessMutex.wait()
stack.push(dataPushBuff)
stackAccessMutex.signal()
itemInStack.signal() #Signal the Consumer, we have data in the stack!
#Note, this call can be placed within the
#stackAccessMutex locking block, but it doesn't
#have to be there. As a matter of convention, any
#code that can be executed outside of a lock,
#should be executed outside of the lock.
Now that we can check to see if there is data in the stack via a semaphore, let's re-write our Consumer thread.
Consumer thread, Version #2
dataRecvBuff = nil # Defining a variable to store the pushed
# content, accessible from only within
# the Consumer thread.
itemsInStack.wait()
stackAccessMutex.wait()
dataRecvBuff = stack.pop()
stackAccessMutex.signal()
…Consume dataRecvBuff as needed since it's removed from the stack…
… and that's it. As you can see, there are two semaphores and both are mandatory (see 2) because we need to lock our stack when it's accessed and we need to signal our consumer when data is available and lock it when there is nothing in the stack.
Hope that answered your question. I'll update my response if you have any specific questions.
Theoretically, when the process starts, you could
pre-initialize your stack with data. In this case, you can should
initialize your itemsInStack semaphore with the value that is
equal to the stack count. However, in the case of this example, we
are assuming that there is no data in the stack, nor none to
initialize.
It is worth mentioning that under one, specific circumstance you
can theoretically get away with using just the stackAccessMutex.
Consider the case where the stack always contains data. If the
stack is infinite, we do not need to signal our Consumer that data
has been added because there always will be data. However, in
reality an "infinite stack" doesn't exist. Even if that should be
the case within your current context, there's no overhead in adding
the safety net of the itemsInStack semaphore.
Also, it may be tempting to to throw out the itemsInStack counting
semaphore if under your current circumstance a call to
stack.pop() would not cause any error if it were to not return any
data on an empty stack.
This is plausible, but not recommended. Assuming the Consumer thread is executing the
code on a loop, the loop will continuously execute the stack consumption code while
there is no data to consume. By using the itemsInStack semaphore, you are pausing the
thread until data arrives which should save a few CPU cycles.

C# socket receive buffer size cost

I am receiving some data over socket (with some start and end character). I can use a byte receiving mechanism that should receive one byte at a time, add it to some queue kind of thing and receive next until ending character found. Or i can make a chunk receiver and find an ending character to terminate my message...
My question is, what is cost of increasing / decreasing buffer size?? in my perception, decreasing buffer size should increase memory io but does increasing buffer verify that I'll be increasing IO performance as well?
Never re-size a buffer in a socket application. It might not matter for a socket application where there aren't that many simultaneous operations. But it's a bad habit that's easy to get used to.
Handling a buffer larger than the actual data isn't that hard to work with. Just check all Stream methods. They have a offset and count property which tells where you should start processing and how many bytes you can process. Same thing here.
And to answer your question: The cost is that .NET need to allocate a new memory "slot" and that the memory gets more fragmented for each request.
Simply allocate a 15kb buffer directly when the socket is connected. Create a buffer pool if you can handle multiple (asynchronous) receives per connection.