Is there an equivalent to TCP_CORK in Winsock? - sockets

In many UNIX TCP implementations, a socket option TCP_CORK is provided which allows the caller to bypass Nagle's algorithm and explicitly specify when to send a physical packet. Is there an equivalent feature in Windows (Winsock)?
TCP_CORK (since Linux 2.2)
If set, don't send out partial frames. All queued partial frames are sent when the option is cleared again. This is useful for prepending headers before calling sendfile(2), or for throughput optimization. As currently implemented, there is a 200 millisecond ceiling on the time for which output is corked by TCP_CORK. If this ceiling is reached, then queued data is automatically transmitted. This option can be combined with TCP_NODELAY only since Linux 2.5.71. This option should not be used in code intended to be portable.
(I'm aware of TCP_NODELAY, but this isn't what I need; I still want multiple writes to be accumulated in the send buffer, and then trigger the TCP stack when I'm ready for it to send a physical packet.)

FWIW I successfully use TCP_NODELAY to get TCP_CORK-style behavior. I do it like this:
unset the TCP_NODELAY flag on the socket
Call send() zero or more times to add your outgoing data into the Nagle-queue
set the TCP_NODELAY flag on the socket
call send() with the number-of-bytes argument set to zero, to force an immediate send of the Nagle-queued data
That works fine for me under Windows, MacOS/X, and Linux. (Note that under Linux the final zero-byte send() isn't necessary)

There is no equivalent. The best you can do is gather your data pieces into your own buffer first, and then send the completed buffer to the socket when ready, and let Nagle handle the packets normally.

Related

What is the difference among those methods to check when NIC receive packet?

I try to benchmark the power of SolarFlare NIC, especially when using Onload. To do this, I search the method to check the time when packet arrives. And I find several methods which can be performed on UNIX environment.
After receive packet through the socket, use ioctl with such socket and option SIGOCGSTAMP. Getting the time when last packet arrives through the socket.
Using setsockopt, set the option SO_TIMESTAMPNS with the socket. By calling recvmsg, get the cmsg and check the timestamp written in cmsg.
Same as 2, but use the option SO_TIMESTAMPING with flag SOF_TIMESTAMPING_RX_SOFTWARE and SOF_TIMESTAMPING_SOFTWARE.
Same as 3, but use flag SOF_TIMESTAMPING_RX_SOFTWARE and SOF_TIMESTAMPING_RAW_HARDWARE.
But I cannot figure out what is the difference between 4 methods and what's really going on with such option. I guess, 1/2/3 uses kernel clock and 4 uses NIC's own clock. But I'm not sure...
Can you precisely explain the difference of above option and possibly other method to check the time of packet receiving?

When is a file descriptor not considered available for writing? [duplicate]

When, exactly, does the BSD socket send() function return to the caller?
In non-blocking mode, it should return immediately, correct?
As for blocking mode, the man page says:
When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in non-blocking I/O mode.
Questions:
Does this mean that the send() call will always return immediately if there is room in the kernel send buffer?
Is the behavior and performance of the send() call identical for TCP and UDP? If not, why not?
Does this mean that the send() call will always return immediately if there is room in the kernel send buffer?
Yes. As long as immediately means after the memory you provided it has been copied to the kernel's buffer. Which, in some edge cases, may not be so immediate. For instance if the pointer you pass in triggers a page fault that needs to pull the buffer in from either a memory mapped file or the swap, that would add significant delay to the call returning.
Is the behavior and performance of the send() call identical for TCP and UDP? If not, why not?
Not quite. Possible performance differences depends on the OS' implementation of the TCP/IP stack. In theory the UDP socket could be slightly cheaper, since the OS needs to do fewer things with it.
EDIT: On the other hand, since you can send much more data per system call with TCP, typically the cost per byte can be a lot lower with TCP. This can be mitigated with sendmmsg() in recent linux kernels.
As for the behavior, it's nearly identical.
For blocking sockets, both TCP and UDP will block until there's space in the kernel buffer. The distinction however is that the UDP socket will wait until your entire buffer can be stored in the kernel buffer, whereas the TCP socket may decide to only copy a single byte into the kernel buffer (typically it's more than one byte though).
If you try to send packets that are larger than 64kiB, a UDP socket will likely consistently fail with EMSGSIZE. This is because UDP, being a datagram socket, guarantees to send your entire buffer as a single IP packet (or train of IP packet fragments) or not send it at all.
Non blocking sockets behave identical to the blocking versions with the single exception that instead of blocking (in case there's not enough space in the kernel buffer), the calls fail with EAGAIN (or EWOULDBLOCK). When this happens, it's time to put the socket back into epoll/kqueue/select (or whatever you're using) to wait for it to become writable again.
As usual when working on POSIX, keep in mind that your call may fail with EINTR (if the call was interrupted by a signal). In this case you most likely want to call send() again.
If there is room in the kernel buffer, then send() copies as many bytes as it can into the buffer and exits immediately, returning how many bytes were actually copied (which can be fewer than how many you requested). If there is no room in the kernel buffer, then send() blocks until either room becomes available or a timeout occurs (if one is configured).
The send() will return as soon as the data has been accepted by the kernel.
In case of blocking socket: The send() will block if the kernel buffer is not free enough to intake the data provided to send() call.
Non blocking sockets: send() will not block, but would fail and returns -1 or it may return number of bytes copied partially(depending on the buffer space available). It sets the errno EWOULDBLOCK or EAGAIN. This means at that time of send(), the buffer was not able to intake all the bytes and you should try again with select() call to send() the data again. Or you could put a loop with a sleep() and call send(), but you have to take care of number of bytes actually sent and the remaining number of bytes that are to be sent.
Does this mean that the send() call
will always return immediately if
there is room in the kernel send
buffer?
Shouldn't it? The moment after which the data "is sent" can be defined differently. I think this is a moment when OS accepted your data for delivery on stack. Otherwise it's quite diffucult to define it. Is it a moment, when data is transmitted to network card buffer? Or after the moment when data is pushed out of network card buffer?
Is there any problem you need to know this for sure or you are just curious?
Your presumption is correct. If there is room in the kernel send buffer, the kernel will copy the data into the send buffer and send() will return.

Why would one need to use `MSG_WAITALL` FLAG instead of `0` FLAG? Why to use it with UDP?

At some point when coding sockets one will face the receive-family of functions (recv, recvfrom, recvmsg).
This function accepts a FLAG argument, in which I see that the MSG_WAITALL is used in many examples on the web, such as this example on UDP.
Here is a definition of the MSG_WAITALL flag
MSG_WAITALL (since Linux 2.2)
This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned. This flag has no effect for datagram sockets.
Hence, my two questions:
Why would one need to use MSG_WAITALL FLAG instead of 0 FLAG? (Could someone explain a scenario of a problem for which the use of this would be the solution?)
Why to use it with UDP?
As the quoted man page mentions, MSG_WAITALL has no effect on UDP sockets, so there's no reason to use it there. Examples that do use it are probably confused and/or the result of several generations of cargo-cult/copy-and-paste programming. :)
For TCP, OTOH, the default behavior of recv() is to block until at least one byte of data can be copied into the user's buffer from the sockets incoming-data-buffer. The TCP stack will try to provide as many bytes of data as it can, of course, but in a case where the socket's incoming-data-buffer contains fewer bytes of data than the user has passed in to recv(), the TCP stack will copy as many bytes as it can, and return the byte-count indicating how many bytes it actually provided.
However, some people find would prefer to have their recv() call keep blocking until all of the bytes in their passed-in array have been filled in, regardless of how long that might take. For those people, the MSG_WAITALL flag provides a simple way to obtain that behavior. (The flag is not strictly necessary, since the programmer could always emulate that behavior by writing a while() loop that calls recv() multiple times as necessary, until all the bytes in the buffer have been populated... but it's provided as a convenience nonetheless)

What guarantees does UDP give?

UDP packets obviously can arrive multiple times, not at all and out of order.
But if packets arrive, is it guaranteed, that any call to recvfrom and similar functions will return exactly one complete packet the sender sent via sendto (or similar)? In other words, is it possible to receive incomplete packets or multiple packets at once? Is it dependent on the OS, or does the standard mandate a certain behavior?
As I mentioned in a comment, the UDP specification (RFC 768) does not specify the behavior of the "interface" between an application program and the OS infrastructure that handles UDP messages.
However, POSIX specification does address this. The key section of the recvfrom spec says this:
The recvfrom() function shall return the length of the message written to the buffer pointed to by the buffer argument. For message-based sockets, such as SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET, the entire message shall be read in a single operation. If a message is too long to fit in the supplied buffer, and MSG_PEEK is not set in the flags argument, the excess bytes shall be discarded.
Note the use of the word "shall". Any OS <-> application API that claims to conform to the POSIX spec would be bound by that language.
In simple terms, any POSIX compliant recvfrom will return one complete UDP message in the buffer provided that the buffer space provided is large enough. If it is not large enough, "excess" bytes will be discarded.
(Some recvfrom implementations support a non-standard MSG_TRUNC flag that allows the application to find out the actual message length. Check the OS-specific manual page for details.)
The recv family of system calls don't behave like that. They do not return frames or packets, they transfer layer 3 payload bytes stored in the processor's internal receive buffers to the user applications buffer. In other words, what ultimately determines how many bytes are passed up is the size of the user's buffer. The behaviour is to try and fill this buffer and if that's not possible to send what data has arrived and if that's not possible then block or return no data depending on how the recv is configured.
From the recv man page (my emphasis)
If a message is too long to fit in the supplied buffer,
excess bytes may be discarded depending on the type of socket the
message is received from.
If no messages are available at the socket, the receive calls wait
for a message to arrive, unless the socket is nonblocking (see
fcntl(2)), in which case the value -1 is returned and the external
variable errno is set to EAGAIN or EWOULDBLOCK. The receive calls
normally return any data available, up to the requested amount,
rather than waiting for receipt of the full amount requested.
The one other factor that needs to be taken into account is the internal recive buffer size. This is a fixed size and attempting to add more data to an already full buffer can result in loss of data. The value can be set with the SO_RCVBUF flag - from the socket man page:
SO_RCVBUF Sets or gets the maximum socket receive buffer in bytes. The
kernel doubles this value (to allow space for bookkeeping
overhead) when it is set using setsockopt(2), and this doubled
value is returned by getsockopt(2). The default value is set
by the /proc/sys/net/core/rmem_default file, and the maximum
allowed value is set by the /proc/sys/net/core/rmem_max file.
The minimum (doubled) value for this option is 256.

Socket read with pcap

I have a socket bound to a NIC that I am using to capture packets in a pcap_loop.
I have a separate process running that eventually does a "read" on that same device, but only after a unix local pipe is ready to be read. Is it correct to say that the read() on the device from the 2nd process will read everything that's ready, not just one packet at a time, even though my other process is set up to use pcap_loop to read a packet at a time?
I have a socket bound to a NIC that I am using to capture packets in a pcap_loop.
You say "socket", so I'm guessing that this is Linux (it could also be IRIX, but that's a lot less likely, and the answer is the same in either case; other OSes don't use sockets in libpcap, the native capture mechanism on those OSes uses mechanisms other than sockets).
I have a separate process running that eventually does a "read" on that same device, but only after a unix local pipe is ready to be read. Is it correct to say that the read() on the device from the 2nd process will read everything that's ready, not just one packet at a time,
No. A PF_PACKET socket returns one packet at a time from a read().
There is, by the way, no guarantee that reading from the socket with a read and handling the same socket in libpcap at the same time will work. Libpcap might be using the memory-mapped mechanism to get the packets; unless you've seen documentation on how the memory-mapped mechanism works with read()s done elsewhere, or have read the Linux kernel code enough to figure out how it works, you might not want to assume it'll work the way you want.
If, however, this is FreeBSD, as suggested (but not stated) by the tag, then what libpcap is using is a BPF device, *NOT* a socket. A read() will give you an entire bufferful of packets, and the read()s done by libpcap will give libpcap an entire bufferful of packets, even if it happens to call your callback once per packet. The same issues of read() vs. memory-mapped access could occur, but the memory-mapped BPF in later versions of FreeBSD isn't, by default, used by libpcap.