What is the difference between "Interrupt coalescing" and the "Nagle algorithm"? - sockets

Is the main difference that?
Interrupt coalescing (ethtool -C eth1 rx-usecs 0) - coalesce the received packets from different connections, i.e. increase bandwitdh, but increase the latency of the receive
Nagle algorithm (socket options = TCP_NODELAY) - coalesce the sent packets from the same connection, i.e. increase bandwitdh, but increasethe the latency of the send

Interrupt coalescing concerns the network driver: the idea is to avoid invoking the interrupt handler anew every time a network packet shows up. Instead, after receiving a packet, the NIC waits until M packets are received or until N microseconds have passed before generating an interrupt. Then the driver can process many packets at once. (Otherwise, with modern gigabit and 10-gigabit adapters, the processor would need to field hundreds of thousands or millions of interrupts per second, which can prevent the system from being able to accomplish much else.) As your link points out, there is (or at least may be) a cost of additional latency since the OS doesn't start processing a received packet at the earliest possible instant.
Nagle's algorithm is focused on reducing the number of packets sent by coalescing payload data from multiple packets into one. The classic example is a telnet session. Without Nagle, every time you press a key, the system has to create an entire new packet (min 64 bytes on Ethernet) to send one byte.
So the intent of interrupt coalescing is to support greater bandwidth utilization, while the intent of Nagle's algorithm is actually to produce lower bandwidth (by sending fewer packets).

Related

How exactly do socket receives work at a lower level (eg. socket.recv(1024))?

I've read many stack overflow questions similar to this, but I don't think any of the answers really satisfied my curiosity. I have an example below which I would like to get some clarification.
Suppose the client is blocking on socket.recv(1024):
socket.recv(1024)
print("Received")
Also, suppose I have a server sending 600 bytes to the client. Let us assume that these 600 bytes are broken into 4 small packets (of 150 bytes each) and sent over the network. Now suppose the packets reach the client at different timings with a difference of 0.0001 seconds (eg. one packet arrives at 12.00.0001pm and another packet arrives at 12.00.0002pm, and so on..).
How does socket.recv(1024) decide when to return execution to the program and allow the print() function to execute? Does it return execution immediately after receiving the 1st packet of 150 bytes? Or does it wait for some arbitrary amount of time (eg. 1 second, for which by then all packets would have arrived)? If so, how long is this "arbitrary amount of time"? Who determines it?
Well, that will depend on many things, including the OS and the speed of the network interface. For a 100 gigabit interface, the 100us is "forever," but for a 10 mbit interface, you can't even transmit the packets that fast. So I won't pay too much attention to the exact timing you specified.
Back in the day when TCP was being designed, networks were slow and CPUs were weak. Among the flags in the TCP header is the "Push" flag to signal that the payload should be immediately delivered to the application. So if we hop into the Waybak
machine the answer would have been something like it depends on whether or not the PSH flag is set in the packets. However, there is generally no user space API to control whether or not the flag is set. Generally what would happen is that for a single write that gets broken into several packets, the final packet would have the PSH flag set. So the answer for a slow network and weakling CPU might be that if it was a single write, the application would likely receive the 600 bytes. You might then think that using four separate writes would result in four separate reads of 150 bytes, but after the introduction of Nagle's algorithm the data from the second to fourth writes might well be sent in a single packet unless Nagle's algorithm was disabled with the TCP_NODELAY socket option, since Nagle's algorithm will wait for the ACK of the first packet before sending anything less than a full frame.
If we return from our trip in the Waybak machine to the modern age where 100 Gigabit interfaces and 24 core machines are common, our problems are very different and you will have a hard time finding an explicit check for the PSH flag being set in the Linux kernel. What is driving the design of the receive side is that networks are getting way faster while the packet size/MTU has been largely fixed and CPU speed is flatlining but cores are abundant. Reducing per packet overhead (including hardware interrupts) and distributing the packets efficiently across multiple cores is imperative. At the same time it is imperative to get the data from that 100+ Gigabit firehose up to the application ASAP. One hundred microseconds of data on such a nic is a considerable amount of data to be holding onto for no reason.
I think one of the reasons that there are so many questions of the form "What the heck does receive do?" is that it can be difficult to wrap your head around what is a thoroughly asynchronous process, wheres the send side has a more familiar control flow where it is much easier to trace the flow of packets to the NIC and where we are in full control of when a packet will be sent. On the receive side packets just arrive when they want to.
Let's assume that a TCP connection has been set up and is idle, there is no missing or unacknowledged data, the reader is blocked on recv, and the reader is running a fresh version of the Linux kernel. And then a writer writes 150 bytes to the socket and the 150 bytes gets transmitted in a single packet. On arrival at the NIC, the packet will be copied by DMA into a ring buffer, and, if interrupts are enabled, it will raise a hardware interrupt to let the driver know there is fresh data in the ring buffer. The driver, which desires to return from the hardware interrupt in as few cycles as possible, disables hardware interrupts, starts a soft IRQ poll loop if necessary, and returns from the interrupt. Incoming data from the NIC will now be processed in the poll loop until there is no more data to be read from the NIC, at which point it will re-enable the hardware interrupt. The general purpose of this design is to reduce the hardware interrupt rate from a high speed NIC.
Now here is where things get a little weird, especially if you have been looking at nice clean diagrams of the OSI model where higher levels of the stack fit cleanly on top of each other. Oh no, my friend, the real world is far more complicated than that. That NIC that you might have been thinking of as a straightforward layer 2 device, for example, knows how to direct packets from the same TCP flow to the same CPU/ring buffer. It also knows how to coalesce adjacent TCP packets into larger packets (although this capability is not used by Linux and is instead done in software). If you have ever looked at a network capture and seen a jumbo frame and scratched your head because you sure thought the MTU was 1500, this is because this processing is at such a low level it occurs before netfilter can get its hands on the packet. This packet coalescing is part of a capability known as receive offloading, and in particular lets assume that your NIC/driver has generic receive offload (GRO) enabled (which is not the only possible flavor of receive offloading), the purpose of which is to reduce the per packet overhead from your firehose NIC by reducing the number of packets that flow through the system.
So what happens next is that the poll loop keeps pulling packets off of the ring buffer (as long as more data is coming in) and handing it off to GRO to consolidate if it can, and then it gets handed off to the protocol layer. As best I know, the Linux TCP/IP stack is just trying to get the data up to the application as quickly as it can, so I think your question boils down to "Will GRO do any consolidation on my 4 packets, and are there any knobs I can turn that affect this?"
Well, the first thing you can do is disable any form of receive offloading (e.g. via ethtool), which I think should get you 4 reads of 150 bytes for 4 packets arriving like this in order, but I'm prepared to be told I have overlooked another reason why the Linux TCP/IP stack won't send such data straight to the application if the application is blocked on a read as in your example.
The other knob you have if GRO is enabled is GRO_FLUSH_TIMEOUT which is a per NIC timeout in nanoseconds which can be (and I think defaults to) 0. If it is 0, I think your packets may get consolidated (there are many details here including the value of MAX_GRO_SKBS) if they arrive while the soft IRQ poll loop for the NIC is still active, which in turn depends on many things unrelated to your four packets in your TCP flow. If non-zero, they may get consolidated if they arrive within GRO_FLUSH_TIMEOUT nanoseconds, though to be honest I don't know if this interval could span more than one instantiation of a poll loop for the NIC.
There is a nice writeup on the Linux kernel receive side here which can help guide you through the implementation.
A normal blocking receive on a TCP connection returns as soon as there is at least one byte to return to the caller. If the caller would like to receive more bytes, they can simply call the receive function again.

UDP Packet loss simulation & probability

I am currently creating a server software that communicates with multiple arduino boards. Due to the hardware, I am using the UDP Protocol. I have a pretty simple mechanism that will resend packages in most of the cases when they get lost. I have two questions now:
How probable is it that UDP Packets get lost in a Network with no Internet access and about 20 arduinos and one computer? Is it even neccessary to have a resend method?
Is there a way I can simulate UDP Packet loss in this network to check if the resend mechanisms are working?
How probable is it that UDP Packets get lost in a Network with no
Internet access and about 20 arduinos and one computer?
The probability is 100% that sooner or later a packet will be dropped.
If you want a more detailed statistic, like the probability of a packet being dropped within any particular period of time, the only real way to know is to try it and found out (using e.g. sequence numbers in the packets so that the receiver(s) can detect when a packet has been dropped by noting the skipped sequence-number). The probability will depend very much on the size of the packets, the speed at which the packets are being sent, the CPU speed of the receivers, what other tasks the receivers are spending CPU time on, the quality of your Ethernet switch, the quality of your Ethernet cables, the phase of the moon, etc etc.
Is it even neccessary to have a resend method?
That depends on what the consequences would be to having a packet dropped. For some applications (e.g. streaming audio or video, or audio-metering data), dropping a packet is no big deal; you just ignore the fact that some data was lost, and continue on with the next packet as usual. For other applications (e.g. file transmit/receive), the loss of a packet means the loss of data that the receiver needs, so you'll want to have some way to recover from that loss, e.g. by detecting it and triggering a resend, or else the entire transfer will fail (or at least the receiver will end up with only a partial file).
Is there a way I can simulate UDP Packet loss in this network to check
if the resend mechanisms are working?
Sure, just put some logic into the receivers so that they occasionally pretend to not have received a packet:
int numBytesReceived = recv(...);
if ((rand()%100) == 0) // Simulate a 1% packet loss rate
{
printf("Pretending to have dropped a packet!\n");
}
else
{
// handle the incoming packet as usual
}

how to receive large number of UDP packets continously in vc++

I am writing an GUI application which receives UDP packets from a FPGA board of 4Gb data continuously (application is a data retrieval system).
I created my own class inherited from CAyncSocket and on receive message I am reading packets through ReceiveFrom API and writing data to file.
As packets are sent continuously from FPGA (about 400k packets of 1KB data) my application is missing the packets. I am receiving only 200k packets. but when I am monitoring with Wireshark all packets are received.
Can anyone suggest any technique or algorithm to solve this problem, so that I can receive large number of UDP packets without loss.
The first thing to understand and accept is that you cannot guarantee that no UDP packets will be dropped. It is part of the nature of the UDP transport layer that any step in the transmission is allowed to drop a UDP packet for any reason, and that this is something that will happen from time to time. In your case, it sounds like the Windows networking stack is dropping the incoming UDP packets after receiving them from the network card, probably because the incoming-UDP-packets buffer associated with your socket is too full and does not have room to store them. This could happen for example if your write-to-disk calls occasionally take a number of milliseconds to return, during which time your app is unable to read more data from the UDP socket.
That said, there are a few things you can do to make the dropping of packets somewhat less likely.
The first (and easiest) thing to do is to increase the size of your socket's incoming-packets-buffer, using setsockopt(SO_RCVBUF). This helps because the larger the buffer is, the more time your program will have to read packets out of the buffer before the networking stack fills the buffer up entirely and starts dropping packets because it has no place to put them.
If that isn't sufficient for your purposes, the other thing you can do is spawn a separate thread that does nothing but receive incoming UDP packets and add them to a queue (for another thread to process later). Because this thread does nothing else besides receive UDP packets, it will be able to respond quickly when new packets have arrived, and thus the incoming-sockets-buffer will be less likely to ever fill up and overflow. You'll probably want to run this thread at a high priority if possible, so that there is less chance of it being held off of the CPU in the case where other threads or programs are competing for CPU time.
If you've implemented both of the above and the rate of packet loss still isn't acceptable, then you may have to step back and re-evaluate your approach. This might include switching from UDP protocol to TCP, or rewriting your code as an in-kernel driver, or switching to a real-time OS that can make better guarantees about response times.

TCP Socket Transfer

A while back i had a question about why my socket sometimes received only 653 octets ( for example ) when i sent 1024 octets and thanks to Rakis i understood: The OS allows reception to occur in arbitrarily sized chunks.
This time i need a confirmation :)
On any OS ( Well GNU/Linux and Windows at least ), In any Language ( I'm using Python here ), if i send a packet of a random number of bytes, can be 2 bytes, can be 12000 bytes, let's say X, when i write socket.send(X), am i absolutely guaranteed that X will be FULLY received ( regardless of any chunks the receiving OS divides it into ) on the other end of the socket BEFORE i do another socket.send(any string) ?
Or in other words if i have the code :
socket.send(X)
socket.send(Y)
Even if X > MTU so it will be obliged to send multiple packets, does it wait until every packet is sent and acknowledged by the endpoint of the socket before sending Y ? Well writing that makes me believe that the answer is yes it is guaranteed and that this is exactly the purpose of setting a socket in blocking mode but i want to be sure :D
Thanks in advance,
Nolhian
You are guaranteed that X will be received (at the application level) before Y, if it's a stream socket. If it's a datagram socket, no guarantees.
Depending on the networking implementation, it's possible that at a lower level, X will be sent, lost in transmission, then Y will be sent, then X will be re-sent because no acknowledgement was received.
Even in blocking mode, the socket.send(Y) can execute before X even makes it "onto the wire", because the OS will buffer network traffic.
No, you can't.
All you know is that the client will receive the data in order, assuming it does receive it all. There's no way of knowing (at the application level) whether the client has received all the data without having some sort of "ACK" at the application level protocol.
am i absolutely guaranteed that X will be FULLY received ( regardless of any chunks the receiving OS divides it into ) on the other end of the socket BEFORE i do another socket.send(any string) ?
No. In general, more data may be sent without waiting for the receiving side, within certain limits:
on the sending side, you will have a maximum amount of data you can enqueue for transmission until the client has acknowledged some receipt (but typically the client's OS will acknowledge and buffer quite a lot before it refuses further data until the application has processed some), after which the sending socket may start blocking
forces the application design to consider how to enqueue and buffer excessive amounts of data, rather than having naively written applications utilise excessive amounts of Operating System-provided buffer memory
reduces retransmission rates when the receiving side is flooded with data too fast to process it
avoids sending huge amounts of data despite the network connection having been lost
So, strictly speaking and for large transmissions, the sender should be designed to handle sockets blocked from further sends (either knowing it is ok to block in the attempt (perhaps due to a dedicated sending thread) or waiting until it is possible to send more via non-blocking sockets or select/poll).
Whatever retransmission and buffering may be required, what you CAN be sure of is that the receiving side will have to read all of "X" before it starts being given the subsequently sent data "Y" (unless it specifically asks to have it otherwise, e.g. Out Of Band data).
Depending on the type of Sockets that you use, you can, in some cases, have a guarantee that data will be received, but not a feedback or a confirmation when it actually was.
Back to your question:
does it wait until every packet is sent and acknowledged by the endpoint of the socket before sending Y
So, you could say:
YES it does wait until it is sent, and
NO it does not wait for acknowledgment
A suggestion:
Since there are no auto-magic/built-in confirmations that your data was received, you could fairly easily implement your own logic for ACKnowledging the package was received, which would basically come down to your custom communication protocol.

Raw Socket Receive Buffer

We are currently testing a Telecom application over IP. We open a Raw Socket and receives messages from the remote side (msgrate#750+msgs/second approx size of 180 bytes excluding IP).
On top of the Raw socket sits a layer called SCTP (just like TCP) which is indicating every now and then that it is missing some packets. Now, we are running Wireshark on the receive node and we can see that packet in Wireshark.
It looks to me that the receive buffer of the socket is small causing IP(?) to drop messages. However, IP Pegs(netstat -sv) show NO dropped packets. We have tried setting the socket receive queue to 40000 without any success.
I would appreciate any pointers as to what option, if any, of IP layer should we be configuring or is there any specific socket option that we need to set.
Thanks for your inputs. However, we have been able to "solve" this problem.
Earlier, I described how we read messages.
Once select returns, we run a loop (to the tune of number of Raw Messages to read which was >1 in our case).
1) we call ioctl(FIONREAD) to find the number of bytes to read;
2) read that many bytes by calling recvfrom
3) send the bytes upto the user
4) go into loop again and call ioctl(FIONREAD) and then repeat the steps
However, at point 4, ioctl(FIONREAD) use to return 0. Our code had a defensive check. It was expecting, a 0 bytes from ioctl(FIONREAD) means that the sender has send an IP header with 0 payload. Therefore, it use to call recvfrom(bytes to read=0) to flush out the IP header lest the select will set again on this.
At time t0, ioctl(FIONREAD) returns 0 as number of bytes to read
At time t1, recvfrom(bytes to read=0) is called.
Sometimes, between t0 and t1, actual data use to get queued in the socket receive queue and use to get discarded as we were calling recvFrom(bytes=0).
Setting, the number of rawMsgsToRead=1 has "solved" this problem. However, my guess is it will impact our performance. Is their any ioctl call which can differentiate between octets in the queue as 0 and IP header with payload 0
I have a few questions and a few things to think about.
1) Which implementation of SCTP are you using and on which OS. Some SCTP implementations are more robust than others.
2) Is SCTP negatively acknowledging the dropped packets? Search for a gap acks in wireshark.
3) And where you see the dropped packets in wireshark are you sure that these are not retransmissions?
4) Where in the system is wireshark monitoring? If it is not on the same wire as your application then it may be seeing messages which your application doesn't.
5) What exactly is the indication SCTP is giving?
If you believe that the IP socket rx buffer is overflowing then you could consider reducing the size of the SCTP RX window; this is often configurable in sctp stacks. The Rx window limits the amount of data that can be outstanding waiting for acknowledgement and consequently restricts the amount of data which could be in the IP buffer.
You could also try raising the priority of your SCTP task so that it more quickly reads messages out of the IP buffers (This may be the easiest thing to try and in my opinion a good thing to do).
Regards