An IOCP documentation interpretation question - buffer ownership ambiguity - winsock

Since I'm not a native English speaker I might be missing something so maybe someone here knows better than me.
Taken from WSASend's doumentation at MSDN:
lpBuffers [in]
A pointer to an array of WSABUF
structures. Each WSABUF structure
contains a pointer to a buffer and the
length, in bytes, of the buffer. For a
Winsock application, once the WSASend
function is called, the system owns
these buffers and the application may
not access them. This array must
remain valid for the duration of the
send operation.
Ok, can you see the bold text? That's the unclear spot!
I can think of two translations for this line (might be something else, you name it):
Translation 1 - "buffers" refers to the OVERLAPPED structure that I pass this function when calling it. I may reuse the object again only when getting a completion notification about it.
Translation 2 - "buffers" refer to the actual buffers, those with the data I'm sending. If the WSABUF object points to one buffer, then I cannot touch this buffer until the operation is complete.
Can anyone tell what's the right interpretation to that line?
And..... If the answer is the second one - how would you resolve it?
Because to me it implies that for each and every data/buffer I'm sending I must retain a copy of it at the sender side - thus having MANY "pending" buffers (in different sizes) on an high traffic application, which really going to hurt "scalability".
Statement 1:
In addition to the above paragraph (the "And...."), I thought that IOCP copies the data to-be-sent to it's own buffer and sends from there, unless you set SO_SNDBUF to zero.
Statement 2:
I use stack-allocated buffers (you know, something like char cBuff[1024]; at the function body - if the translation to the main question is the second option (i.e buffers must stay as they are until the send is complete), then... that really screws things up big-time! Can you think of a way to resolve it? (I know, I asked it in other words above).

The answer is that the overlapped structure and the data buffer itself cannot be reused or released until the completion for the operation occurs.
This is because the operation is completed asynchronously so even if the data is eventually copied into operating system owned buffers in the TCP/IP stack that may not occur until some time in the future and you're notified of when by the write completion occurring. Note that with write completions these may be delayed for a surprising amount of time if you're sending without explicit flow control and relying on the the TCP stack to do flow control for you (see here: some OVERLAPS using WSASend not returning in a timely manner using GetQueuedCompletionStatus?) ...
You can't use stack allocated buffers unless you place an event in the overlapped structure and block on it until the async operation completes; there's not a lot of point in doing that as you add complexity over a normal blocking call and you don't gain a great deal by issuing the call async and then waiting on it.
In my IOCP server framework (which you can get for free from here) I use dynamically allocated buffers which include the OVERLAPPED structure and which are reference counted. This means that the cleanup (in my case they're returned to a pool for reuse) happens when the completion occurs and the reference is released. It also means that you can choose to continue to use the buffer after the operation and the cleanup is still simple.
See also here: I/O Completion Port, How to free Per Socket Context and Per I/O Context?

Related

Why would one need to use `MSG_WAITALL` FLAG instead of `0` FLAG? Why to use it with UDP?

At some point when coding sockets one will face the receive-family of functions (recv, recvfrom, recvmsg).
This function accepts a FLAG argument, in which I see that the MSG_WAITALL is used in many examples on the web, such as this example on UDP.
Here is a definition of the MSG_WAITALL flag
MSG_WAITALL (since Linux 2.2)
This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned. This flag has no effect for datagram sockets.
Hence, my two questions:
Why would one need to use MSG_WAITALL FLAG instead of 0 FLAG? (Could someone explain a scenario of a problem for which the use of this would be the solution?)
Why to use it with UDP?
As the quoted man page mentions, MSG_WAITALL has no effect on UDP sockets, so there's no reason to use it there. Examples that do use it are probably confused and/or the result of several generations of cargo-cult/copy-and-paste programming. :)
For TCP, OTOH, the default behavior of recv() is to block until at least one byte of data can be copied into the user's buffer from the sockets incoming-data-buffer. The TCP stack will try to provide as many bytes of data as it can, of course, but in a case where the socket's incoming-data-buffer contains fewer bytes of data than the user has passed in to recv(), the TCP stack will copy as many bytes as it can, and return the byte-count indicating how many bytes it actually provided.
However, some people find would prefer to have their recv() call keep blocking until all of the bytes in their passed-in array have been filled in, regardless of how long that might take. For those people, the MSG_WAITALL flag provides a simple way to obtain that behavior. (The flag is not strictly necessary, since the programmer could always emulate that behavior by writing a while() loop that calls recv() multiple times as necessary, until all the bytes in the buffer have been populated... but it's provided as a convenience nonetheless)

What's the read logic when I call recvfrom() function in C/C++

I wrote a C++ program to create a socket and bind on this socket to receive ICMP/UDP packets. The code I wrote as following:
while(true){
recvfrom(sockId, rePack, sizeof(rePack), 0, (struct sockaddr *)&raddr, (socklen_t *)&len);
processPakcet(recv_size);
}
So, I used a endless while loop to receive messages continually, But I worried about the following two questions:
1, How long the message would be kept in the receiver queue or say in NIC queue?
I worried about that if it takes too long to process the first message, then I might miss the second message. so how fast should I read after read.
2, How to prevent reading the duplicated messages?
i.e, does the receiver queue knows me, when my thread read the first message done, would the queue automatically give me the second one? or say, when I read the first message, then the first message would be deleted by the queue and no one could receive it again.
Additionally, I think the while(true) module is not good, anyone could give me a good suggestion please. (I heard something like polling module).
First, you should always check the return value from recvfrom. It's unlikely the recvfrom will fail, but if it does (for example, if you later implement signal handling, it might fail with EINTR) you will be processing undefined data. Also, of course, the return value tells you the size of the packet you received.
For question 1, the actual answer is operating system-dependent. However, most operating systems will buffer some number of packets for you. The OS interrupt handler that handles the incoming packet will never be copying it directly into your application level buffer, so it will always go into an OS buffer first. The OS has previously noted your interest in it (by virtue of creating the socket and binding it you expressed interest), so it will then place a pointer to the buffer onto a queue associated with your socket.
A different part of the OS code will then (after the interrupt handler has completed) copy the data from the OS buffer into your application memory, free the OS buffer, and return to your program from the recvfrom system call. If additional packets come in, either before or after you have started processing the first one, they'll be placed on the queue too.
That queue is not infinite of course. It's likely that you can configure how many packets (or how much buffer space) can be reserved, either at a system-wide level (think sysctl-type settings in linux), or at the individual socket level (setsockopt / ioctl).
If, when you call recvfrom, there are already queued packets on the socket, the system call handler will not block your process, instead it will simply copy from the OS buffer of the next queued packet into your buffer, release the OS buffer, and return immediately. As long as you can process incoming packets roughly as fast as they arrive or faster, you should not lose any. (However, note that if another system is generating packets at a very high rate, it's likely that the OS memory reserved will be exhausted at some point, after which the OS will simply discard packets that exceed its resource reservation.)
For question 2, you will receive no duplicate messages (unless something upstream of your machine is actually duplicating them). Once a queued message is copied into your buffer, it's released before returning to you. That message is gone forever.
(Note that it's possible that some other process has also created a socket expressing interest in the same packets. That process would also get a copy of the packet data, which is typically handled internal to the operating system by reference counting rather than by actually duplicating the OS buffers, although that detail is invisible to applications. In any case, once all interested processes have received the packet, it will be discarded.)
There's really nothing at all wrong with a while (true) loop; it's a very common control structure for long-running server-type programs. If your program has nothing else it needs to be doing in the meantime, while true allowing it to block in recvfrom is the simplest and hence clearest way to implement it.
(You could use a select(2) or poll(2) call to wait. This allows you to handle waiting for any one of multiple file descriptors at the same time, or to periodically "time out" and go do something else, say, but again if you have nothing else you might need to be doing in the meantime, that is introducing needless complication.)

Can we edit callback function HAL_UART_TxCpltCallback for our convenience?

I am a newbie to both FreeRTOS and STM32. I want to know how exactly callback function HAL_UART_TxCpltCallback for HAL_UART_Transmit_IT works ?
Can we edit that that callback function for our convenience ?
Thanks in Advance
You call HAL_UART_Transmit_IT to transmit your data in the "interrupt" (non-blocking) mode. This call returns immediately, likely well before your data gets fully trasmitted.
The sequence of events is as follows:
HAL_UART_Transmit_IT stores a pointer and length of the data buffer you provide. It doesn't perform a copy, so your buffer you passed needs to remain valid until callback gets called. For example it cannot be a buffer you'll perform delete [] / free on before callbacks happen or a buffer that's local in a function you're going to return from before a callback call.
It then enables TXE interrupt for this UART, which happens every time the DR (or TDR, depending on STM in use) is empty and can have new data written
At this point interrupt happens immediately. In the IRQ handler (HAL_UART_IRQHandler) a new byte is put in the DR (TDR) register which then gets transmitted - this happens in UART_Transmit_IT.
Once this byte gets transmitted, TXE interrupt gets triggered again and this process repeats until reaching the end of the buffer you've provided.
If any error happens, HAL_UART_ErrorCallback will get called, from IRQ handler
If no errors happened and end of buffer has been reached, HAL_UART_TxCpltCallback is called (from HAL_UART_IRQHandler -> UART_EndTransmit_IT).
On to your second question whether you can edit this callback "for convenience" - I'd say you can do whatever you want, but you'll have to live with the consequences of modifying code what's essentially a library:
Upgrading HAL to newer versions is going to be a nightmare. You'll have to manually re-apply all your changes you've done to that code and test them again. To some extent this can be automated with some form of version control (git / svn) or even patch files, but if the code you've modified gets changed by ST, those patches will likely not apply anymore and you'll have to do it all by hand again. This may require re-discovering how the implementation changed and doing all your work from scratch.
Nobody is going to be able to help you as your library code no longer matches code that everyone else has. If you introduced new bugs by modifying library code, no one will be able to reproduce them. Even if you provided your modifications, I honestly doubt many here will bother to apply your changes and test them in practice.
If I was to express my personal opinion it'd be this: if you think there's bugs in the HAL code - fix them locally and report them to ST. Once they're fixed in future update, fully overwrite your HAL modifications with updated official release. If you think HAL code lacks functionality or flexibility for your needs, you have two options here:
Suggest your changes to ST. You have to keep in mind that HAL aims to serve "general purpose" needs.
Just don't use HAL for this specific peripheral. This "mixed" approach is exactly what I do personally. In some cases functionality provided by HAL for given peripheral is "good enough" to serve my needs (in my case one example is SPI where I fully rely on HAL) while in some other cases - such as UART - I use HAL only for initialization, while handling transmission myself. Even when you decide not to use HAL functions, it can still provide some value - you can for example copy their IRQ handler to your code and call your functions instead. That way you at least skip some parts in development.

Objective-C: Calling and copying the same block from multiple threads

I'm dealing with neural networks here, but it's safe to ignore that, as the real question has to deal with blocks in objective-c. Here is my issue. I found a way to convert a neural network into a big block that can be executed all at once. However, it goes really, really slow, relative to activating the network. This seems a bit counterintuitive.
If I gave you a group of nested functions like
CGFloat answer = sin(cos(gaussian(1.5*x + 2.5*y)) + (.3*d + bias))
//or in block notation
^(CGFloat x, CGFloat y, CGFloat d, CGFloat bias) {
return sin(cos(gaussian(1.5*x + 2.5*y)) + (.3*d + bias));
};
In theory, running that function multiple times should be easier/quicker than looping through a bunch of connections, and setting nodes active/inactive, etc, all of which essentially calculate this same function in the end.
However, when I create a block (see thread: how to create function at runtime) and run this code, it is slow as all hell for any moderately sized network.
Now, what I don't quite understand is:
When you copy a block, what exactly are you copying?
Let's say, I copy a block twice, copy1 and copy2. If I call copy1 and copy2 on the same thread, is the same function called? I don't understand exactly what the docs mean for block copies: Apple Block Docs
Now if I make that copy again, copy1 and copy2, but instead, I call the copies on separate threads, now how do the functions behave? Will this cause some sort of slowdown, as each thread attempts to access the same block?
When you copy a block, what exactly
are you copying?
You are copying any state the block has captured. If that block captures no state -- which that block appears not to -- then the copy should be "free" in that the block will be a constant (similar to how #"" works).
Let's say, I copy a block twice, copy1
and copy2. If I call copy1 and copy2
on the same thread, is the same
function called? I don't understand
exactly what the docs mean for block
copies: Apple Block Docs
When a block is copied, the code of the block is never copied. Only the captured state. So, yes, you'll be executing the exact same set of instructions.
Now if I make that copy again, copy1
and copy2, but instead, I call the
copies on separate threads, now how do
the functions behave? Will this cause
some sort of slowdown, as each thread
attempts to access the same block?
The data captured within a block is not protected from multi-threaded access in any way so, no, there would be no slowdown (but there will be all the concurrency synchronization fun you might imagine).
Have you tried sampling the app to see what is consuming the CPU cycles? Also, given where you are going with this, you might want to become acquainted with your friendly local disassembler (otool -TtVv binary/or/.o/file) as it can be quite helpful in determining how costly a block copy really is.
If you are sampling and seeing lots of time in the block itself, then that is just your computation consuming lots of CPU time. If the block were to consume CPU during the copy, you would see the consumption in a copy helper.
Try creating a source file that contains a bunch of different kinds of blocks; with parameters, without, with captured state, without, with captured blocks with/without captured state, etc.. and a function that calls Block_copy() on each.
Disassemble that and you'll gain a deep understanding on what happens when blocks are copied. Personally, I find x86_64 assembly to be easier to read than ARM. (This all sounds like good blog fodder -- I should write it up).

How can I get a callback when there is some data to read on a boost.asio stream without reading it into a buffer?

It seems that since boost 1.40.0 there has been a change to the way that the the async_read_some() call works.
Previously, you could pass in a null_buffer and you would get a callback when there was data to read, but without the framework reading the data into any buffer (because there wasn't one!). This basically allowed you to write code that acted like a select() call, where you would be told when your socket had some data on it.
In the new code the behaviour has been changed to work in the following way:
If the total size of all buffers in the sequence mb is 0, the asynchronous read operation shall complete immediately and pass 0 as the argument to the handler that specifies the number of bytes read.
This means that my old (and incidentally, the method shown in this official example) way of detecting data on the socket no longer works. The problem for me is that I need a way detecting this because I've layered my own streaming classes on-top of the asio socket streams and as such, I cannot just read data off the sockets that my streams will expect to be there. The only workaround I can think of right now is to read a single byte, store it and when my stream classes then request some bytes, return that byte if one is set: not pretty.
Does anyone know of a better way to implement this kind of behaviour under the latest boost.asio code?
My quick test with an official example with boost-1.41 works... So I think it still should work (if you use null_buffers)