setsockopt with SO_SENDBUF/SO_RECVBUF is not working - sockets

I was using setsockopt with SO_SENDBUF/SO_RECVBUF for setting the send/receive buffer of TCP with 256*1024 bytes.
But when I see in wireshark, I can see that the TCP's "Window Size" is shown as only 1525.
Also wmem_max and rmem_max are set with values 131071(126 kb).So ideally I was expecting it to be at least 128 kbps.
can anyone please help with this ?
Is this can also be a problem of wireshark where it is showing wrong "Window size".

You need to set that size on the listening socket at the server, before any accepts(), and at the client you need to set it on the socket before you connect it. That way you are allowing TCP's 'window scaling' option to take effect, which can only happen during the connect handshake. After the connection is established it is too late. That way the TCP receive window can be as large as the receive buffer, assuming various other conditions hold.
However unless you have an extraordinarily high-latency network with extraordinary bandwidth, 256k may be too large a size. There is no point whatsoever in setting it higher than the bandwidth-delay product, which can be calculated as the bandwidth in bytes/second times the delay in seconds, giving a result in bytes.

Related

How exactly do socket receives work at a lower level (eg. socket.recv(1024))?

I've read many stack overflow questions similar to this, but I don't think any of the answers really satisfied my curiosity. I have an example below which I would like to get some clarification.
Suppose the client is blocking on socket.recv(1024):
socket.recv(1024)
print("Received")
Also, suppose I have a server sending 600 bytes to the client. Let us assume that these 600 bytes are broken into 4 small packets (of 150 bytes each) and sent over the network. Now suppose the packets reach the client at different timings with a difference of 0.0001 seconds (eg. one packet arrives at 12.00.0001pm and another packet arrives at 12.00.0002pm, and so on..).
How does socket.recv(1024) decide when to return execution to the program and allow the print() function to execute? Does it return execution immediately after receiving the 1st packet of 150 bytes? Or does it wait for some arbitrary amount of time (eg. 1 second, for which by then all packets would have arrived)? If so, how long is this "arbitrary amount of time"? Who determines it?
Well, that will depend on many things, including the OS and the speed of the network interface. For a 100 gigabit interface, the 100us is "forever," but for a 10 mbit interface, you can't even transmit the packets that fast. So I won't pay too much attention to the exact timing you specified.
Back in the day when TCP was being designed, networks were slow and CPUs were weak. Among the flags in the TCP header is the "Push" flag to signal that the payload should be immediately delivered to the application. So if we hop into the Waybak
machine the answer would have been something like it depends on whether or not the PSH flag is set in the packets. However, there is generally no user space API to control whether or not the flag is set. Generally what would happen is that for a single write that gets broken into several packets, the final packet would have the PSH flag set. So the answer for a slow network and weakling CPU might be that if it was a single write, the application would likely receive the 600 bytes. You might then think that using four separate writes would result in four separate reads of 150 bytes, but after the introduction of Nagle's algorithm the data from the second to fourth writes might well be sent in a single packet unless Nagle's algorithm was disabled with the TCP_NODELAY socket option, since Nagle's algorithm will wait for the ACK of the first packet before sending anything less than a full frame.
If we return from our trip in the Waybak machine to the modern age where 100 Gigabit interfaces and 24 core machines are common, our problems are very different and you will have a hard time finding an explicit check for the PSH flag being set in the Linux kernel. What is driving the design of the receive side is that networks are getting way faster while the packet size/MTU has been largely fixed and CPU speed is flatlining but cores are abundant. Reducing per packet overhead (including hardware interrupts) and distributing the packets efficiently across multiple cores is imperative. At the same time it is imperative to get the data from that 100+ Gigabit firehose up to the application ASAP. One hundred microseconds of data on such a nic is a considerable amount of data to be holding onto for no reason.
I think one of the reasons that there are so many questions of the form "What the heck does receive do?" is that it can be difficult to wrap your head around what is a thoroughly asynchronous process, wheres the send side has a more familiar control flow where it is much easier to trace the flow of packets to the NIC and where we are in full control of when a packet will be sent. On the receive side packets just arrive when they want to.
Let's assume that a TCP connection has been set up and is idle, there is no missing or unacknowledged data, the reader is blocked on recv, and the reader is running a fresh version of the Linux kernel. And then a writer writes 150 bytes to the socket and the 150 bytes gets transmitted in a single packet. On arrival at the NIC, the packet will be copied by DMA into a ring buffer, and, if interrupts are enabled, it will raise a hardware interrupt to let the driver know there is fresh data in the ring buffer. The driver, which desires to return from the hardware interrupt in as few cycles as possible, disables hardware interrupts, starts a soft IRQ poll loop if necessary, and returns from the interrupt. Incoming data from the NIC will now be processed in the poll loop until there is no more data to be read from the NIC, at which point it will re-enable the hardware interrupt. The general purpose of this design is to reduce the hardware interrupt rate from a high speed NIC.
Now here is where things get a little weird, especially if you have been looking at nice clean diagrams of the OSI model where higher levels of the stack fit cleanly on top of each other. Oh no, my friend, the real world is far more complicated than that. That NIC that you might have been thinking of as a straightforward layer 2 device, for example, knows how to direct packets from the same TCP flow to the same CPU/ring buffer. It also knows how to coalesce adjacent TCP packets into larger packets (although this capability is not used by Linux and is instead done in software). If you have ever looked at a network capture and seen a jumbo frame and scratched your head because you sure thought the MTU was 1500, this is because this processing is at such a low level it occurs before netfilter can get its hands on the packet. This packet coalescing is part of a capability known as receive offloading, and in particular lets assume that your NIC/driver has generic receive offload (GRO) enabled (which is not the only possible flavor of receive offloading), the purpose of which is to reduce the per packet overhead from your firehose NIC by reducing the number of packets that flow through the system.
So what happens next is that the poll loop keeps pulling packets off of the ring buffer (as long as more data is coming in) and handing it off to GRO to consolidate if it can, and then it gets handed off to the protocol layer. As best I know, the Linux TCP/IP stack is just trying to get the data up to the application as quickly as it can, so I think your question boils down to "Will GRO do any consolidation on my 4 packets, and are there any knobs I can turn that affect this?"
Well, the first thing you can do is disable any form of receive offloading (e.g. via ethtool), which I think should get you 4 reads of 150 bytes for 4 packets arriving like this in order, but I'm prepared to be told I have overlooked another reason why the Linux TCP/IP stack won't send such data straight to the application if the application is blocked on a read as in your example.
The other knob you have if GRO is enabled is GRO_FLUSH_TIMEOUT which is a per NIC timeout in nanoseconds which can be (and I think defaults to) 0. If it is 0, I think your packets may get consolidated (there are many details here including the value of MAX_GRO_SKBS) if they arrive while the soft IRQ poll loop for the NIC is still active, which in turn depends on many things unrelated to your four packets in your TCP flow. If non-zero, they may get consolidated if they arrive within GRO_FLUSH_TIMEOUT nanoseconds, though to be honest I don't know if this interval could span more than one instantiation of a poll loop for the NIC.
There is a nice writeup on the Linux kernel receive side here which can help guide you through the implementation.
A normal blocking receive on a TCP connection returns as soon as there is at least one byte to return to the caller. If the caller would like to receive more bytes, they can simply call the receive function again.

RAW socket send: packet loss

During the RAW-socket based packet send testing, I found very irritating symptoms.
With a default RAW socket setting (especially for SO_SNDBUF size),
the raw socket send 100,000 packets without problem but it took about 8 seconds
to send all the packets, and the packets are correctly received by the receiver process.
It means about 10,000 pps (packets per second) is achieved by the default setting.
(I think it's too small figure contrary to my expectation.)
Anyway, to increase the pps value, I increased the packet send buffer size
by adjusting the /proc/sys/net/core/{wmem_max, wmem_default}.
After increasing the two system parameters, I have identified the irritating symptom.
The 100,000 packets are sent promptly, but only the 3,000 packets are
received by the receiver process (located at a remote node).
At the sending Linux box (Centos 5.2), I did netstat -a -s and ifconfig.
Netstat showed that 100,000 requests sent out, but the ifconfig shows that
only 3,000 packets are TXed.
I want to know the reason why this happens, and I also want to know
how can I solve this problem (of course I don't know whether it is really a problem).
Could anybody give me some advice, examples, or references to this problem?
Best regards,
bjlee
You didn't say what size your packets were or any characteristics of your network, NIC, hardware, or anything about the remote machine receiving the data.
I suspect that instead of playing with /proc/sys stuff, you should be using ethtool to adjust the number of ring buffers, but not necessarily the size of those buffers.
Also, this page is a good resource.
I have just been working with essentially the same problem. I accidentally stumbled across an entirely counter-intuitive answer that still doesn't make sense to me, but it seems to work.
I was trying larger and larger SO_SNDBUF buffer sizes, and losing packets like mad. By accidentally overrunning my system defined maximum, it set the SO_SNDBUF size to a very small number instead, but oddly enough, I no longer had the packet loss issue. So I intentionally set SO_SNDBUF to 1, which again resulted in a very small number (not sure, but I think it actually set it to something like 1k), and amazingly enough, still no packet loss.
If anyone can explain this, I would be most interested in hearing it. In case it matters, my version of Linux is RHEL 5.11 (yes, I know, I'm a bit behind the times).

General overhead of creating a TCP connection

I'd like to know the general cost of creating a new connection, compared to UDP. I know TCP requires an initial exchange of packets (the 3 way handshake). What would be other costs? For instance is there some sort of magic in the kernel needed for setting up buffers etc?
The reason I'm asking is I can keep an existing connection open and reuse it as needed. However if there is little overhead reconnecting it would reduce complexity.
Once a UDP packet's been dumped onto the wire, the UDP protocol stack is free to completely forget about it. With TCP, there's at bare minimum the connection details (source/dest port and source/dest IP), the sequence number, the window size for the connection etc... It's not a huge amount of data, but adds up quickly on a busy server with many connections.
And then there's the 3-way handshake as well. Some braindead (and/or malicious systems) can abuse the process (look up 'syn flood'), or just drop the connection on their end, leaving your system waiting for a response or close notice that'll never come. The plus side is that with TCP the system will do its best to make sure the packet gets where it has to. With UDP, there's no guarantees at all.
Compared to the latency of the packet exchange, all other costs such as kernel setup times are insignificant.
OPTION 1: The general cost of creating a TCP connection are:
Create socket connection
Send data
Tear down socket connection
Step 1: Requires an exchange of packets, so it's delayed by to & from network latency plus the destination server's service time. No significant CPU usage on either box is involved.
Step 2: Depends on the size of the message.
Step 3: IIRC, just sends a 'closing now' packet, w/ no wait for destination ack, so no latency involved.
OPTION 2: Costs of UDP:*
Create UDP object
Send data
Close UDP object
Step 1: Requires minimal setup, no latency worries, very fast.
Step 2: BE CAREFUL OF SIZE, there is no retransmit in UDP since it doesn't care if the packet was received by anyone or not. I've heard that the larger the message, the greater probability of data being received corrupted, and that a rule of thumb is that you'll lose a certain percentage of messages over 20 MB.
Step 3: Minimal work, minimal time.
OPTION 3: Use ZeroMQ Instead
You're comparing TCP to UDP with a goal of reducing reconnection time. THERE IS A NICE COMPROMISE: ZeroMQ sockets.
ZMQ allows you to set up a publishing socket where you don't care if anyone is listening (like UDP), and have multiple listeners on that socket. This is NOT a UDP socket - it's an alternative to both of these protocols.
See: ZeroMQ.org for details.
It's very high speed and fault tolerant, and is in increasing use in the financial industry for those reasons.

Benefits of "Don't Fragment" on TCP Packets?

One of our customers is having trouble submitting data from our application (on their PC) to a server (different geographical location). When sending packets under 1100 bytes everything works fine, but above this we see TCP retransmitting the packet every few seconds and getting no response. The packets we are using for testing are about 1400 bytes (but less than 1472). I can send an ICMP ping to www.google.com that is 1472 bytes and get a response (so it's not their router/first few hops).
I found that our application sets the DF flag for these packets, and I believe a router along the way to the server has an MTU less than/equal to 1100 and dropping the packet.
This affects 1 client in 5000, but since everybody's routes will be different this is expected.
The data is a SOAP envelope and we expect a SOAP response back. I can't justify WHY we do it, the code to do this was written by a previous developer.
So... Are there any benefits OR justification to setting the DF flag on TCP packets for application data?
I can think of reasons it is needed for network diagnostics applications but not in our situation (we want the data to get to the endpoint, fragmented or not). One of our sysadmins said that it might have something to do with us using SSL, but as far as I know SSL is like a stream and regardless of fragmentation, as long as the stream is rebuilt at the end, there's no problem.
If there's no good justification I will be changing the behaviour of our application.
Thanks in advance.
The DF flag is typically set on IP packets carrying TCP segments.
This is because a TCP connection can dynamically change its segment size to match the path MTU, and better overall performance is achieved when the TCP segments are each carried in one IP packet.
So TCP packets have the DF flag set, which should cause an ICMP Fragmentation Needed packet to be returned if an intermediate router has to discard a packet because it's too large. The sending TCP will then reduce its estimate of the connection's Path MTU (Maximum Transmission Unit) and re-send in smaller segments. If DF wasn't set, the sending TCP would never know that it was sending segments that are too large. This process is called PMTU-D ("Path MTU Discovery").
If the ICMP Fragmentation Needed packets aren't getting through, then you're dealing with a broken network. Ideally the first step would be to identify the misconfigured device and have it corrected; however, if that doesn't work out then you add a configuration knob to your application that tells it to set the TCP_MAXSEG socket option with setsockopt(). (A typical example of a misconfigured device is a router or firewall that's been configured by an inexperienced network administrator to drop all ICMP, not realising that Fragmentation Needed packets are required by TCP PMTU-D).
The operation of Path-MTU discovery is described in RFC 1191, https://www.rfc-editor.org/rfc/rfc1191.
It is better for TCP to discover the Path-MTU than to have every packet over a certain size fragmented into two pieces (typically one large and one small).
Apparently, some protocols like NFS benefit from avoiding fragmentation (link text). However, you're right in that you typically shouldn't be requesting DF unless you really require it.

Raw Socket Receive Buffer

We are currently testing a Telecom application over IP. We open a Raw Socket and receives messages from the remote side (msgrate#750+msgs/second approx size of 180 bytes excluding IP).
On top of the Raw socket sits a layer called SCTP (just like TCP) which is indicating every now and then that it is missing some packets. Now, we are running Wireshark on the receive node and we can see that packet in Wireshark.
It looks to me that the receive buffer of the socket is small causing IP(?) to drop messages. However, IP Pegs(netstat -sv) show NO dropped packets. We have tried setting the socket receive queue to 40000 without any success.
I would appreciate any pointers as to what option, if any, of IP layer should we be configuring or is there any specific socket option that we need to set.
Thanks for your inputs. However, we have been able to "solve" this problem.
Earlier, I described how we read messages.
Once select returns, we run a loop (to the tune of number of Raw Messages to read which was >1 in our case).
1) we call ioctl(FIONREAD) to find the number of bytes to read;
2) read that many bytes by calling recvfrom
3) send the bytes upto the user
4) go into loop again and call ioctl(FIONREAD) and then repeat the steps
However, at point 4, ioctl(FIONREAD) use to return 0. Our code had a defensive check. It was expecting, a 0 bytes from ioctl(FIONREAD) means that the sender has send an IP header with 0 payload. Therefore, it use to call recvfrom(bytes to read=0) to flush out the IP header lest the select will set again on this.
At time t0, ioctl(FIONREAD) returns 0 as number of bytes to read
At time t1, recvfrom(bytes to read=0) is called.
Sometimes, between t0 and t1, actual data use to get queued in the socket receive queue and use to get discarded as we were calling recvFrom(bytes=0).
Setting, the number of rawMsgsToRead=1 has "solved" this problem. However, my guess is it will impact our performance. Is their any ioctl call which can differentiate between octets in the queue as 0 and IP header with payload 0
I have a few questions and a few things to think about.
1) Which implementation of SCTP are you using and on which OS. Some SCTP implementations are more robust than others.
2) Is SCTP negatively acknowledging the dropped packets? Search for a gap acks in wireshark.
3) And where you see the dropped packets in wireshark are you sure that these are not retransmissions?
4) Where in the system is wireshark monitoring? If it is not on the same wire as your application then it may be seeing messages which your application doesn't.
5) What exactly is the indication SCTP is giving?
If you believe that the IP socket rx buffer is overflowing then you could consider reducing the size of the SCTP RX window; this is often configurable in sctp stacks. The Rx window limits the amount of data that can be outstanding waiting for acknowledgement and consequently restricts the amount of data which could be in the IP buffer.
You could also try raising the priority of your SCTP task so that it more quickly reads messages out of the IP buffers (This may be the easiest thing to try and in my opinion a good thing to do).
Regards