TCP buffering on read - sockets

I want to reduce the latency of a TCP server. So I read about, and used TCP_NODELAY. Great! overall latency went down a bit! Now I'm thinking that I can probably also reduce latency when reading. But I don't understand very well the behavior of the TCP stack. What happens for example in the following code in the receiver side, if the sender sends a packet of just 25 bytes?
BUFFER_SIZE = 4096
char buffer[BUFFER_SIZE];
received = read (common_socket, buffer, BUFFER_SIZE);
My particular question is, if the socket is blocking, when the call to read will return? Are there any cases when TCP will wait a little bit for more data to arrive before a return from the read call?

if the socket is blocking, when the call to read will return?
If there is data in the socket receive buffer or a pending end-of-stream or error it will return immediately, otherwise it will block, once, until one of those conditions occurs.
Are there any cases when TCP will wait a little bit for more data to arrive before a return from the read call?
No.

read is a blocking call, that means it will block at the read line until you'll receive something.
If you receive less than your buffer size, you'll move to the next operation and your "received" variable will hold the number of bytes that have been read.
reference:
http://man7.org/linux/man-pages/man2/read.2.html
On success, the number of bytes read is returned (zero indicates end
of file), and the file position is advanced by this number. It is
not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to end-of-
file, or because we are reading from a pipe, or from a terminal), or
because read() was interrupted by a signal. On error, -1 is
returned, and errno is set appropriately. In this case, it is left
unspecified whether the file position (if any) changes.

Related

When is a file descriptor not considered available for writing? [duplicate]

When, exactly, does the BSD socket send() function return to the caller?
In non-blocking mode, it should return immediately, correct?
As for blocking mode, the man page says:
When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in non-blocking I/O mode.
Questions:
Does this mean that the send() call will always return immediately if there is room in the kernel send buffer?
Is the behavior and performance of the send() call identical for TCP and UDP? If not, why not?
Does this mean that the send() call will always return immediately if there is room in the kernel send buffer?
Yes. As long as immediately means after the memory you provided it has been copied to the kernel's buffer. Which, in some edge cases, may not be so immediate. For instance if the pointer you pass in triggers a page fault that needs to pull the buffer in from either a memory mapped file or the swap, that would add significant delay to the call returning.
Is the behavior and performance of the send() call identical for TCP and UDP? If not, why not?
Not quite. Possible performance differences depends on the OS' implementation of the TCP/IP stack. In theory the UDP socket could be slightly cheaper, since the OS needs to do fewer things with it.
EDIT: On the other hand, since you can send much more data per system call with TCP, typically the cost per byte can be a lot lower with TCP. This can be mitigated with sendmmsg() in recent linux kernels.
As for the behavior, it's nearly identical.
For blocking sockets, both TCP and UDP will block until there's space in the kernel buffer. The distinction however is that the UDP socket will wait until your entire buffer can be stored in the kernel buffer, whereas the TCP socket may decide to only copy a single byte into the kernel buffer (typically it's more than one byte though).
If you try to send packets that are larger than 64kiB, a UDP socket will likely consistently fail with EMSGSIZE. This is because UDP, being a datagram socket, guarantees to send your entire buffer as a single IP packet (or train of IP packet fragments) or not send it at all.
Non blocking sockets behave identical to the blocking versions with the single exception that instead of blocking (in case there's not enough space in the kernel buffer), the calls fail with EAGAIN (or EWOULDBLOCK). When this happens, it's time to put the socket back into epoll/kqueue/select (or whatever you're using) to wait for it to become writable again.
As usual when working on POSIX, keep in mind that your call may fail with EINTR (if the call was interrupted by a signal). In this case you most likely want to call send() again.
If there is room in the kernel buffer, then send() copies as many bytes as it can into the buffer and exits immediately, returning how many bytes were actually copied (which can be fewer than how many you requested). If there is no room in the kernel buffer, then send() blocks until either room becomes available or a timeout occurs (if one is configured).
The send() will return as soon as the data has been accepted by the kernel.
In case of blocking socket: The send() will block if the kernel buffer is not free enough to intake the data provided to send() call.
Non blocking sockets: send() will not block, but would fail and returns -1 or it may return number of bytes copied partially(depending on the buffer space available). It sets the errno EWOULDBLOCK or EAGAIN. This means at that time of send(), the buffer was not able to intake all the bytes and you should try again with select() call to send() the data again. Or you could put a loop with a sleep() and call send(), but you have to take care of number of bytes actually sent and the remaining number of bytes that are to be sent.
Does this mean that the send() call
will always return immediately if
there is room in the kernel send
buffer?
Shouldn't it? The moment after which the data "is sent" can be defined differently. I think this is a moment when OS accepted your data for delivery on stack. Otherwise it's quite diffucult to define it. Is it a moment, when data is transmitted to network card buffer? Or after the moment when data is pushed out of network card buffer?
Is there any problem you need to know this for sure or you are just curious?
Your presumption is correct. If there is room in the kernel send buffer, the kernel will copy the data into the send buffer and send() will return.

What guarantees does UDP give?

UDP packets obviously can arrive multiple times, not at all and out of order.
But if packets arrive, is it guaranteed, that any call to recvfrom and similar functions will return exactly one complete packet the sender sent via sendto (or similar)? In other words, is it possible to receive incomplete packets or multiple packets at once? Is it dependent on the OS, or does the standard mandate a certain behavior?
As I mentioned in a comment, the UDP specification (RFC 768) does not specify the behavior of the "interface" between an application program and the OS infrastructure that handles UDP messages.
However, POSIX specification does address this. The key section of the recvfrom spec says this:
The recvfrom() function shall return the length of the message written to the buffer pointed to by the buffer argument. For message-based sockets, such as SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET, the entire message shall be read in a single operation. If a message is too long to fit in the supplied buffer, and MSG_PEEK is not set in the flags argument, the excess bytes shall be discarded.
Note the use of the word "shall". Any OS <-> application API that claims to conform to the POSIX spec would be bound by that language.
In simple terms, any POSIX compliant recvfrom will return one complete UDP message in the buffer provided that the buffer space provided is large enough. If it is not large enough, "excess" bytes will be discarded.
(Some recvfrom implementations support a non-standard MSG_TRUNC flag that allows the application to find out the actual message length. Check the OS-specific manual page for details.)
The recv family of system calls don't behave like that. They do not return frames or packets, they transfer layer 3 payload bytes stored in the processor's internal receive buffers to the user applications buffer. In other words, what ultimately determines how many bytes are passed up is the size of the user's buffer. The behaviour is to try and fill this buffer and if that's not possible to send what data has arrived and if that's not possible then block or return no data depending on how the recv is configured.
From the recv man page (my emphasis)
If a message is too long to fit in the supplied buffer,
excess bytes may be discarded depending on the type of socket the
message is received from.
If no messages are available at the socket, the receive calls wait
for a message to arrive, unless the socket is nonblocking (see
fcntl(2)), in which case the value -1 is returned and the external
variable errno is set to EAGAIN or EWOULDBLOCK. The receive calls
normally return any data available, up to the requested amount,
rather than waiting for receipt of the full amount requested.
The one other factor that needs to be taken into account is the internal recive buffer size. This is a fixed size and attempting to add more data to an already full buffer can result in loss of data. The value can be set with the SO_RCVBUF flag - from the socket man page:
SO_RCVBUF Sets or gets the maximum socket receive buffer in bytes. The
kernel doubles this value (to allow space for bookkeeping
overhead) when it is set using setsockopt(2), and this doubled
value is returned by getsockopt(2). The default value is set
by the /proc/sys/net/core/rmem_default file, and the maximum
allowed value is set by the /proc/sys/net/core/rmem_max file.
The minimum (doubled) value for this option is 256.

what happens when recv() called with null buffer

I'm curious to know what happens when below call is made .
recv(<socket no>, NULL , <lenght> , 0) ;
and also one more question is that , after calling recv function data on socket gets flushed or what happens really ?
Regards,
Kiran
I don't know why you would even ask this question. Don't call recv() with a null buffer, that is my advice. The result is at best undefined, at worst a SIGSEGV at least.
and also one more question is that , after calling recv function data on socket gets flushed or what happens really ?
It depends. In TCP the unread data remains to be read next time. In UDP the unread part of the datagram is discarded.
If you provide a non-zero length, you'd better have a non-NULL pointer to put the data in otherwise you'll be in all sorts of trouble.
If the specified length is zero, you should get a zero return value with no data written to the buffer.
Whether or not data is discarded if the buffer provided is too small depends on the type of connection.
The return value of recv would be -1 if the buffer is NULL and the size is nonzero. I tested it and errno was set to ECONNABORTED, but that sounds like a strange error to me.
Using cygwin and gcc with posix sockets.
If you receive data and the requested size is smaller, you can repeatedly call recv to get the rest of the data. Depending on your socket being nonblocking or not recv returns -1 when the socket has no more data and errno is EAGAIN or EWOULDBLOCK. If it is blocking it will block until the buffer is filled or the function is interrupted.

UDP non-blocking socket on a real-time OS: sendto() and recvfrom() can return with partial message?

This is my first message here.
I'm working with a non-blocking UDP socket on a real-time OS (OnTime and VxWorks).
I have read the documentation and some forums but I have some doubts about 'atomicity' of sendto() and recvfrom() functions:
sendto() returns the number of bytes enqueued or error. Is it possible that it's less then the input buffer length? Maybe the output buffer has not enough free space and just few bytes are enqueued...
recvfrom() returns the number of byte received or error. Is it possible that it's less then the size of message the source has sent? I mean partial message reading...
I hopes reading and writing functions are atomic (full message or no message read/write).
Thanks.
Emanuele.
I asked to OnTime support and they told me that it's possible that sendto() will enqueue a partial message if the output buffer has not enough free space. I don't know if also recvfrom() could return a partial message in some cases. I suppose that there's no standard behavior on socket implementations among different OS.
sendto() returns the number of bytes enqueued or error. Is it possible
that it's less then the input buffer length?
No. It's sent wholly, or not at all for UDP.
recvfrom() returns the number of byte received or error. Is it
possible that it's less then the size of message the source has sent?
I mean partial message reading...
If the buffers of the OS stack can't hold an entire UDP packet, it is dropped. If your application buffers can't hold the entire packet, you get the initial content of the packet.
i.e. you can read a partial message in just one case, that is if the data cannot fit in the buffer you pass to sendto(). In that case the rest of the packet is discarded. With recvmsg() you can detect if the packet was truncated, but this is normally resolved by either using a max sized buffer (UDP must fit in an IP packet which MTU is 2^16-1), or by designing the protocol you use inside UDP where you set your own reasonable max size.
I'm not really familiar with these systems, but I would be very surprised if they break normal UDP socket semantics, which is always to enqueue a full datagram on "send" and to dequeue a full single datagram on a "receive".

MSG_READALL is to recv() as ?? is to send()

From the recv(2) man page:
MSG_WAITALL
This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned.
It doesn't look like there's an equivalent flag for send(2), which strikes me as strange. Maybe send()s are guaranteed to always accept the whole buffer, but I don't see that written anywhere (and anyway, that seems unlikely to me).
Is there a way to tell send() to wait until it's sent the whole buffer before returning, equivalently to recv()'s MSG_WAITALL?
Edit: I understand that send() just copies data into a buffer in the operating system and that I can't force send() to put data onto the wire. My question is: Can I force send() to block until all the data I gave it has been copied into the OS's buffer?
You can't. send just offloads the buffer to the kernel, then returns.
To quote from the Unix standard:
The send() function shall initiate transmission of a message from the specified socket to its peer (...)
Successful completion of a call to send() does not guarantee delivery of the message.
Note the word "initiate". It doesn't actually send anything, rather tells the OS to send the message when it's ready for it (when its buffers are full or after some time has passed).
send(2) for TCP does not actually "send" anything on the wire, but places your bytes into the socket send buffer. It tells you how many bytes it was able to copy there in the return value.
Make the send buffer bigger (see setsockopt(2) and tcp(7)), pay attention to the syscall return value. In any case, TCP is a stream. You need to manage application-level protocol yourself.