How deterministic are packet sizes when using TCP?

How deterministic are packet sizes when using TCP? - sockets

Recently we ran into what looked like a connectivity issue when a particular customer of ours installed our product. We ultimately traced it to a low MTU (~1300 bytes) being configured on one of the devices in the network. In this particular deployment, we had two Windows machines running our application communicating with one another, and their link MTUs were set at 1500.
One thing that made this particularly difficult to troubleshoot, was that our application would work fine during the handshake phase (where only small requests are sent), but would sometimes fail sending a specific request of size ~4KB across the network. If it makes a difference, the application is written in C# and these are WCF messages.
What could account for this indeterminism? I would have expected this to always fail, as the message size we were sending was always larger than the link MTU perceived by the Windows client, which would lead to at least one full 1500-byte packet, which would lead to problems. Is there something in TCP that could make it prefer smaller packets, but only sometimes?
Some other things that we thought might be related:
1) The sockets were constantly being set up and torn down (as the application received what it interpreted as a network failure), so this doesn't appear to be related to TCP slow start.
2) I'm assuming that WCF "quickly" pushes the entire 4KB message to the socket, so there's always something to send that's larger than 1500 bytes.
3) Using WireShark, I didn't spot any TCP retransmissions which might explain why only subsets of the buffer were being sent.
4) Using WireShark, I saw a single 4KB IP packet being sent, which perhaps indicates that TCP Segment Offloading is being implemented by the NIC? (I'm not sure how TSO would look on WireShark). I didn't see in WireShark the 4KB request being broken down to multiple IP packets, in either successful or unsuccessful instances.
5) The customer claims that there's no route between the two Windows machines that circumvents the "problematic" device with the small MTU.
Any thoughts on this would be appreciated.

Related

TCP or UDP for lots of connections?

I want to create a P2P network with the following characteristics:
low latency is not really important
loosing packages is okay
the nodes would only send tiny amounts of data around
there will be no NAT/firewall issues, every node has an open port on its public ip
every node is connected to every other node
Usually I would use TCP for anything not time-critical but the last requirements causes the nodes to have lots of open connections for a long time. If I remember correctly, using TCP to connect to 1000 servers would mean I had to use 1000 ports to handle these connections. UDP on the other hand, would only require a single port for each node.
So my question is: Is TCP able to handle the above requirements in a network with e.g. 1000 nodes without tweaking the system? Would UDP be better suited in this case? Is there anything else that would be a deal-breaker for either protocol?

With UDP you control the "connection state" and it is pretty much the best way to do anything peer to peer related IF you have a high number of nodes or care about bandwidth, memory and CPU overhead. By moving all the control to your application in regards to the "connection state" of each node you minimize the amount of wasted resources by making it fit your needs exactly.
You will bypass a lot of operating system specific weirdness that limits the effectiveness of TCP with high numbers of connections. There is TIME_WAIT bloat and tens to hundreds of OS specific settings which will need tweaking for every user of your P2P app if it needs those high numbers. A test app I made which allowed you to use UDP with ack or TCP showed only a 10% difference in performance regardless of operating system using UDP. TCP performance was always lower than the best UDP and its performance varied wildly by over 600% depending upon the OS. With tweaks you can make most OS perform roughly the same using TCP but by default most are not properly tweaked.
So in my opinion it is harder to make a reliable UDP P2P network compared to TCP but it is often needed. However I would only advise that route it if you were quite experienced with networking as there are a lot of "gotchas" to deal with. There are libraries which help with this like Raknet or Enet. They provide ways to do reliable UDP but it still takes a higher amount of networking knowledge to know how this all ties in together, whereas with TCP it is mostly hidden from you.
In a peer to peer network you often have messages like NODE PINGs that you may not care if each one is always received, you just care if you have received one recently. ie You may send a ping every 10 seconds, and disconnect the node after 60 seconds of no ping. This would mean you would need 6 ping packets in a row to fail, which is highly unlikely unless the node is really down. If you received even one ping in that 60 second period then the node is still active. A TCP implementation of this would have involved more latency and bandwidth as it makes sure EACH ping message gets through and will block any other data going out until it does. And since you cannot rely on TCP to reliably tell you if a connection is dead, you are forced to add similar PING features for TCP, on top of all the other things TCP is already doing extra with your packets.
Games also often have data that if its not received by a client it is no big deal because there are more packets coming in a few milliseconds which will invalidate any missed packets. ie Player is moving from A to Z over a time span of 1 second, his client sends out each packet, roughly 40 milliseconds apart ABCDEFG__I__KLMNOPQRSTUVWXYZ Do we really care if we miss "H and J" since every 40ms we are receiving updates? Not really, this is where prediction can come into it, but this is usually not relevant to most P2P projects. If that was TCP instead of UDP then it would have increased bandwidth requirements and added latency to the rest of the packets being received as the data will be resent until it arrives, on top of the extra latency it is already adding by acking everything.
Essentially you can lower latency and network overhead for many messages in a peer to peer network using UDP. However there will always be some messages which NEED to be sent reliably and that requires you to basically implement some reliable way to get packets to that node, similar to that of TCP. And this is where you need some level of expertise if you want a reliable peer to peer network. Some things to look into include sequencing packets with a number, message ACKs, etc.
If you care a lot about efficiency or really need tens of thousands of connections then implementing your specific protocol in UDP will always be better than TCP. But there are cases to be made for TCP, like if the time to make the project matters or if you are a new to network programming.

If I remember correctly, using TCP to connect to 1000 servers would mean I had to use 1000 ports to handle these connections.
You remember wrong.
Take a web server which is listening on port 80 and can handle 1000s of connections at the same time on this single port. This is because a connection is defined by the tuple of {client-ip,client-port,server-ip,server-port}. And while server-ip and server-port are the same for all connections to this server the client-ip and client-port are not. Even if the client-ip is the same (i.e. same client) the client would pick a different source port.
... with e.g. 1000 nodes without tweaking the system?
This depends on the system since each of the open connections needs to preserve the state and thus needs memory. This might be a problem for embedded systems with only little memory.
In any case: if your protocol is just sending small messages and if packet loss, reordering or duplication are acceptable than UDP might be the better choice because the overhead (connection setup, ACK..) is smaller and it takes less memory. You could also use a single socket to exchange data with all 1000 nodes whereas with TCP you would need a separate socket for each connection (socket is not the same as port!). Using only a single socket might allow for a simpler application design.

I want to amend the answer by Steffen with a few points:
1000 connections are nothing for any normal computer and OS.
UDP fits your requirements. It might be easier to program because it is message oriented. TCP provides a stream of bytes. You need to layer a messaging protocol on top of that which is not that easy. Also, you need to handle broken TCP connections by reconnecting.
Ports are not scarce. No problem with consuming 1000 ports.

Does listen() backlog affect established TCP connections?

Would it be naive to create a TCP socket with a listen backlog set to minimum as a way of rate limiting new incoming connections? The server workload in question doesn't expect many new connections at any time but spends a lot of time servicing long open persistent connections. It appears that new incoming connections shouldn't affect established connections, though I've been unable to find any definitive answer in any text. Is it possible for failed new incoming connections to create some kind of TCP traffic congestion on the server with the packets it's receiving or are they dropped fast enough that it has no effect on any buffers or other part of the network stack?
Specifically the platform in use is Linux, and although it may be handled differently in different OSs, I expect them to all behave roughly the same.
EDIT What I mean by the "same" is that backlog doesn't affect established connections, though I do understand Linux discards them while Windows sends a reset.

Does listen() backlog affect established TCP connections?
It affects established connections that the server hasn't accepted yet via accept(), only in the sense that it limits the number of such connections that can exist.
Would it be naive to create a TCP socket with a listen backlog set to minimum as a way of rate limiting new incoming connections?
All it would accomplish would be to unnecessarily fail some connecting clients. They won't get any service until your server gets around to it anyway, and once the backlog queue fills they are rate-limited by your service code anyway. There is no particular reason why shortening the queue would have any beneficial effect. The other problem with the idea is that it isn't readily possible to determine what the minimum actually is, or whether you succeeded in setting it as the backlog queue length.
It appears that new incoming connections shouldn't affect established connections, though I've been unable to find any definitive answer in any text.
That is correct. There is no reason why it should affect them: that's why you won't find it written down anywhere, any more than the fact that the phase of the moon doesn't affect it either.
Is it possible for failed new incoming connections to create some kind of TCP traffic congestion on the server with the packets it's receiving
No.
or are they dropped fast enough that it has no effect on any buffers or other part of the network stack?
They're not dropped. They simply aren't even created if they won't fit on the backlog queue. Ergo their resource consumption at the server is zero.
Specifically the platform in use is Linux, and although it may be handled differently in different OSs, I expect them to all behave roughly the same.
They don't. On Windows, an incoming connection when the backlog queue is full causes an RST to be issued. On other platforms it is simply ignored.

What you describe are several types of attacks like flooding, syn attacks and other goodies resulting in denial of service.
This topic is not easy, because protection has to be implemented in all the layers, including TCP. For instance a SYN attack, fiddling with the sequence numbers, ... . At that point the packet in question already came a long way, through the ethernet layer and ip layer, bottom line it is taking resources. So if your system is under attack, the attacking packets are in your data stream just like the good ones are. The faster you can detect a packet is faulty and drop it, the better. Usually a system that is under attack will be slower. Well at least the systems that I have worked with.
Some attacks try to bring your system in a faulty state permanently, this by exploiting bugs. For instance TCP has a receive queue, if packets are constantly arriving out of order they will be stored in that receive queue. If the missing packet never arrives, then this receive queue could keep on growing and growing. Without the proper defense , this would lead to the system going completely out of resources.
There are specialised tools (codenumicon for instance) to check the vulnerability of a TCP stack implementation. You can assume that the one on linux has been properly tested using similar tools.
An attack can also occur on the application layer. If you have a TCP server and it allows only a limited amount of sessions. A malicious user can simply take all the connections simply by establishing all the connections and then not doing anything with it. So you have to create some defense as well. Weather or not you set this limit very low or high does not change a thing. A malicious user will try anything to bring your system down. You need to built in defense anyway. You can connect to a webserver (HTTP) simply using telnet. If you don't send anything the server's defense will come into play and close the connection.
So bringing the amount of possible connections to a low value and thinking that this in itself is a form of protection is indeed naive.
Is it possible for failed new incoming connections to create some kind of TCP traffic congestion on the server with the packets it's receiving or are they dropped fast enough that it has no effect on any buffers or other part of the network stack?
They are using resources of your machine and will make your system run slower.
It appears that new incoming connections shouldn't affect established connections, though I've been unable to find any definitive answer in any text.
If it is normal user trying to establish a connection, even if he is doing it continuously, retrying upon failure. The influence will be minimal, close to nothing. But a malicious user that is flooding connections attempts will have influence on the system performance, because the system has to spent time identifying those flawed packets and dropping them asap.

UDP packet loss at the OS level: Windows and Linux, but only with many clients running

I've been investigating an issue where we have a single UDP server sending to multiple clients. The server is sending data out on a multicast channel and port. The clients are running on the same machine and each client opens a socket to the same port as every other client.
We stagger the start of the clients. When we reach a certain number of clients, say 10, we start seeing packet drops. We've eliminated the NIC as the issue using various monitoring tools and the socket buffer size is several times larger than the message size. Our sending interval is quite large (five seconds) and the clients do nothing with the data so the rate of consumption is a non-factor. As the title says we've reproduced the issue on both Windows Server 2008 and Linux (not sure about the version).
Our current theory is that the 10th client puts too much load on the OS which is copying all this data to each socket. The thing is we're only sending 500,000 bytes every five seconds, which doesn't seem like much at all.
Mostly I'm posting here in the hopes that someone has seen a similar problem. I was pointed to this hotfix in my search but it did not solve the issue. Any resources for investigating the details of the OS internals which handle network traffic would be appreciated. Unfortunately I lack this kind of domain knowledge and it has been difficult to find good and detailed reading material on the subject.

Best socket options for client and sever that continuously transfer data

I am using Java (although I think the socket options is implement in most languages) to implement a client and server. The server sends data to the client for processing which the client acknowledges. On another port the client then sends the results of the processing back to the server. When it comes to options such as
SO_LINGER
SO_KEEPALIVE
SO_NODELAY
SO_REUSEADDRESS
SO_SENDBUFFER
SO_RECBUFFER
TCP_NODELAY
We have noticed that the connection between the client and server occasionally breaks. There will be a timeout on the send or the receive. When this happens will kill the socket and open a new one to continue.
What would be the best options to set in terms of the above scenario and is there anything that we could do from our side (programmatically or options-wise) to try minimize the amount of times the connection is dropped. We are using normal TCP/IP.
UPDATE:
The bounty on this ends soon. I haven't had a satisfactory answer yet so it is still open. I think everyone is missing the point of the quest. What is the best practice with regards to the options above for sockets that continuously chat. I have already got a ping packet in that if there is no work to be done (hardly ever the scenario) the normal message is sent with no inner elements so there is always processing.

Strictly speaking, you don't need any of these socket options:
* SO_LINGER
You need to set SO_LINGER only if your application still has outstanding packets to send when close(2) or shutdown(2) has been called. Not really applicable for your application.
* SO_KEEPALIVE
Sending keepalive-pings every two hours would really only help very long-lived but -very- quiet connections going through stateful firewalls with very long session timeouts. (Two hours between pings is entirely too long to be practical in today's Internet.)
* SO_NODELAY
This (presumably an alias for TCP_NODELAY) disables Nagle's algorithm, which is just a small-packet-avoidance problem. Perhaps Nagle is getting in the way in your application, but it takes special sequences of packets to introduce 500ms delays into processing; it never just hangs connections.
* SO_REUSEADDRESS
Useful for all 'servers' that listen on well-known port numbers; use on 'clients' is almost always covering up some bug or other, but it is sometimes necessary if requests must come from a well-known port number.
* SO_SENDBUFFER
* SO_RECBUFFER
These buffer sizes influence the kernel-side buffer sizes maintained for receiving or sending data while your program (receive buffer) or the socket (send buffer) isn't yet ready to accept more data. If these are set too small, your application might not transfer data as smoothly as possible, reducing throughput, but it should not lead to any stalls if these are set smaller than optimal. Of course, too large may put unreasonable demands on kernel memory, but there should be a reasonable system-wide maximum allowed size.
* TCP_NODELAY
Disables Nagle. Not likely to do more than introduce 500ms delays if your application sends multiple small packets before attempting a blocking read.
Really, you shouldn't need to set any socket options.
Can you distill your code into something that could be pasted here and tested or inspected? I'm used to TCP sessions surviving for days or weeks without trouble, so this is pretty surprising.

First I think that this page is relevant, regarding half-open connections.
http://nitoprograms.blogspot.com/2009/05/detection-of-half-open-dropped.html
That being said, TCP is designed to hide connection problems, so you may often find yourself in cases where the connection is broken, but neither side thinks it is. You have addressed this partially by using timeouts and taking that as a sign the connection is broken.
Since you are writing the client and server, I would avoid relying on TCP to tell you when the connection is broken altogether. I would just have the server also acknowledge the receipt of the result from the client. Then both sides will expect immediate responses to their messages, and you can track which messages have been ack'd and set an appropriately small timeout for receiving the ack. This is not a timeout on the send or receive, but a timeout on the time between sending a message and receiving the ack for that message. Then you can set the timeout appropriately depending on the quality of your connection (e.g. very small if you are running on loopback, but large if running over wireless with a weak signal).
Regarding the options you list, you will want to use SO_REUSEADDRESS so that you won't be prevented from reopening the socket, for example if it hasn't finished closing from a previously killed process.

You probably have, but it is best to check the obvious....
Have you verified that it IS the socket that is timing out, and not your code? Sockets are fairly stable, and while there might be an issue somewhere, it seems more likely that it is in your code. I would use logs, timestamps, and synchronised clocks to be sure.
There may be an issue that you genuinely DO take a long time to do the calculation, so maybe adding a 'I'm still thinking about it' message to your protocol that gets sent regularly, to keep the connection alive?
Of course networks will drop out from time to time regardless of what you do, and it sounds like you are already handling that case nicely.

try these options
SO_LINGER - for specyfying when the Socket close s called while some unsent data in the queue
TCP_NODELAY - For non blocking datat transfer

I would strongly encourage you to use a ping/echo model between client and server, so that if no data is sent for x seconds a ping message needs to be send. A typical reason for a break might be a firewall, which shuts down socketss because of inactivity.
The typical issue where the TCP model fails are physical problems e.g. a pulled/broken cable and hangs on one side, where technically someone is listening until a queue overrun kicks in (which might never happen given your amount of data).

What are the chances the connection is going through a NAT firewall somewhere along the way? Stateful firewalls maintain a table of open connections so that packets belonging to an allowed connection can quickly pass through the system, without forcing firewall admins to write overly-complex rule sets.
The downside is that this table can grow immensely large, so it must be pruned as connections are closed or as they appear to have simply grown stale and died quietly. A connection that has gone silent for 20 minutes is usually quiet enough to reaped. (Which is really very quick, as the TCP KEEPALIVE is typically two hours, making it nearly useless in the face of NAT firewalls.)
So: is this going through a NAT firewall? Is the connection quiet for long stretches? If so, add a ping/pong to your protocol, and fire it every few minutes.

UDP Server Discovery - Should clients send multicasts to find server or should server send regular beacon?

I have clients that need to all connect to a single server process. I am using UDP discovery for the clients to find the server. I have the client and server exchange IP address and port number, so that a TCP/IP connection can be established after completion of the discovery. This way the packet size is kept small. I see that this could be done in one of two ways using UDP:
Each client sends out its own multicast message in search of the server, which the server then responds to. The client can repeat sending this multicast message in regular intervals (in the case that the server is down) until the server responds.
The server sends out a multicast message beacon at regular intervals. The clients subscribe to the multicast group and in this way receives the server's multicast message and complete the discovery.
In 1. if there are many clients then initially there would be many multicast messages transmitted (one from each client). Only the server would subscribe and receive the multicast messages from the clients. Once the server has responded to the client, the client ceases to send out the multicast message. Once all clients have completed their discovery of the server no further multicast messages are transmitted on the network. If however, the server is down, then each client would be sending out a multicast message beacon in intervals until the server is back up and can respond.
In 2. only the server would submit a multicast message beacon in regular intervals. This message would end up getting routed to all clients that are subscribed to the multicast group. Once the clients receive the packet the client's UDP listening socket gets closed and they are no longer subscribed to the multicast group. However, the server must continue to send the multicast beacon, so that new clients can discover it. It would continue sending out the beacon at regular intervals regardless of whether any clients are out their requiring discovery or not.
So, I see pros and cons either way. It seems to me that #1 would result in heavier load initially, but this load eventually reduces down to zero. In #2 the server would continue sending out a beacon forever.
UDP and multicast is a fairly new topic to me, so I am interested in finding out which would be the preferred approach and which would result in less network load.

I've used option #2 in the past several times. It works well for simple network topologies. We did see some throughput problems when UDP datagrams exceeded the Ethernet MTU resulting in a large amount of fragmentation. The largest problem that we have seen is that multicast discovery breaks down in larger topologies since many routers are configured to block multicast traffic.
The issue that Greg alluded to is rather important to consider when you are designing your protocol suite. As soon as you move beyond simple network topologies, you will have to find solutions for address translation, IP spoofing, and a whole host of other issues related to the handoff from your discovery layer to your communications layer. Most of them have to do specifically with how your server identifies itself and ensuring that the identification is something that a client can make use of.
If I could do it over again (how many times have we uttered this phrase), I would look for standards-based discovery mechanisms that fit the bill and start solving the other protocol suite problems. The last thing that you really want to do is come up with a really good discovery scheme that breaks the week after you deploy it because of some unforeseen network topology. Google service discovery for a starting list. I personally tend towards DNS-SD but there are a lot of other options available.

I would recommend method #2, as it is likely (depending on the application) that you will have far more clients than you will servers. By having the server send out a beacon, you only send one packet every so often, rather than one packet for each client.
The other benefit of this method, is that it makes it easier for the clients to determine when a new server becomes available, or when an existing server leaves the network, as they don't have to maintain a connection to each server, or keep polling each server, to find out.

Both are equally viable methods.
The argument for method #1 would be that in normal principle, clients initiate requests, and servers listen and respond to them.
The argument for method #2 would be that the point of multicast is so that one host can send a packet and it can be received by many clients (one-to-many), so it's meant to be the reverse of #1.
OK, as I think about this I'm actually drawn to #2, server-initiated beacon. The problem with #1 is that let's say clients broadcast beacons, and they hook up with the server, but the server either goes offline or changes its IP address.
When the server is back up and sends its first beacon, all the clients will be notified at the same time to reconnect, and your entire system is back up immediately. With #1, all of the clients would have to individually realize that the server is gone, and they would all start multicasting at the same time, until connected back with the server. If you had 1000 clients and 1 server your network load would literally be 1000x greater than method #2.
I know these messages are most likely small, and 1000 packets at a time is nothing to a UDP network, but just from a design standpoint #2 feels better.
Edit: I feel like I'm developing a split-personality disorder here, but just thought of a powerful point of why #1 would be an advantage... If you ever wanted to implement some sort of natural load balancing or scaling with multiple servers, design #1 works well for this. That way the first "available" server can respond to the client's beacon and connect to it, as opposed to #2 where all the clients jump to the beaconing server.

Your option #2 has a big limitation in that it assumes that the server can communicate more or less directly with every possible client. Depending on the exact network architecture of your operational system, this may not be the case. For example, you may be depending that all routers and VPN software and WANs and NATs and whatever other things people connect networks together with, can actually handle the multicast beacon packets.
With #1, you are assuming that the clients can send a UDP packet to the server. This is an entirely reasonable expectation, especially considering the very next thing the client will do is make a TCP connection to the same server.
If the server goes down and the client wants to find out when it's back up, be sure to use exponential backoff otherwise you will take the network down with a packet storm someday!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse