recv - When does it start to discard unprocessed packets? - sockets

We are using recv(2) in order to listen to incoming ICMP packets in a network management application in which there is a status poll feature to ping all managed hosts.
In a productive environment, the number of managed hosts can become pretty large (1,000+), and at the moment, when said status poll is performed, we send out all the ping requests sequentially with no delay inbetween.
Obviously, this will lead to many ping replies being sent back almost simultaneously, and it appears that our dispatching routine cannot catch up. This seems to cause that packets are dropped and the ping replies are never actually received by our application. We believe so because many hosts are falsely detected as being "unavailable".
The dispatcher doesn't do anything more than adding incoming replies to an "inbox" of sorts, which is processed later by another thread, ie it does not take much time and can probably not be optimized any further.
Is it possible that the internal packet buffer used by recv (in the kernel? in the network hardware?) fills up and starts dropping packets? If so, is there a good way to determine a reasonable maximum amount of simultaneous pings that can be performed safely (ie by getting that buffer size from the OS)?

Related

How to delete duplicate UDP packets from the send queue

I have UDP implementation with facility to get back acknowledge from server. The client re-sends packets for which acknowledgement is not received from server with in a specified time. Clients send around 10 packets while waiting for acknowledgement from server for 1st packet. It then repeats sending packets for which acknowledgement is not received. This works fine in normal scenario with minor delay in network.
The real issue is being experienced on a low bandwidth connection where round trip delay is a bit significant. Clients keeps on adding packets in send queue based on acknowledgement timeouts. This results into many duplicate packets getting added to queue.
Tried to find any elegant solution to avoid duplicate packets in send queue with no luck. Any help will be appreciated.
If I can get a way to mark/set a property of a packet such that if packet is not send within NN ms then it will be removed from queue then I can build algorithm around it.
UDP has no builtin duplicate detection as is the case with TCP. This means any kind of such detection has to be done by the application itself. Since the only way an application can interact with the send queue is to send datagrams any kind of duplicate detection on the sender side has to be done before the packet gets put into the send queue.
How you figure out at this stage if this is really a duplicate packet to a previous one which should not be sent or if this a duplicate packet which should be sent because the original one got lost is fully up to the application. And any "...not send within NN ms..." has to be implemented in the application too with timers or similar. You might additionally try to get more control of the queue by reducing the size of the send queue with SO_SNDBUF.

How deterministic are packet sizes when using TCP?

Recently we ran into what looked like a connectivity issue when a particular customer of ours installed our product. We ultimately traced it to a low MTU (~1300 bytes) being configured on one of the devices in the network. In this particular deployment, we had two Windows machines running our application communicating with one another, and their link MTUs were set at 1500.
One thing that made this particularly difficult to troubleshoot, was that our application would work fine during the handshake phase (where only small requests are sent), but would sometimes fail sending a specific request of size ~4KB across the network. If it makes a difference, the application is written in C# and these are WCF messages.
What could account for this indeterminism? I would have expected this to always fail, as the message size we were sending was always larger than the link MTU perceived by the Windows client, which would lead to at least one full 1500-byte packet, which would lead to problems. Is there something in TCP that could make it prefer smaller packets, but only sometimes?
Some other things that we thought might be related:
1) The sockets were constantly being set up and torn down (as the application received what it interpreted as a network failure), so this doesn't appear to be related to TCP slow start.
2) I'm assuming that WCF "quickly" pushes the entire 4KB message to the socket, so there's always something to send that's larger than 1500 bytes.
3) Using WireShark, I didn't spot any TCP retransmissions which might explain why only subsets of the buffer were being sent.
4) Using WireShark, I saw a single 4KB IP packet being sent, which perhaps indicates that TCP Segment Offloading is being implemented by the NIC? (I'm not sure how TSO would look on WireShark). I didn't see in WireShark the 4KB request being broken down to multiple IP packets, in either successful or unsuccessful instances.
5) The customer claims that there's no route between the two Windows machines that circumvents the "problematic" device with the small MTU.
Any thoughts on this would be appreciated.

Game server TCP networking sockets - fairness

I'm writing a game server for a turn-based game. One criteria is that the game needs to be as fair for all players as possible.
So far it works like this:
Each client has a TCP connection. (If relevant, the connection is opened via WebSockets)
While running, continually check for incoming socket messages via epoll.
Iterate through clients with sockets ready to read:
Read all messages from the client.
Update the internal game state for each message.
Queue outgoing messages to affected clients.
At the end of each "window" (turn):
Iterate through clients and write all queued outgoing messages to their sockets
My concern for fairness raises the following questions:
Does it matter in which order I send messages to the clients?
Calling write() on all the sockets takes only a fraction of a second for my program, but somewhere in the underlying OS or networking would it make a difference if I sorted the client list?
Perhaps I should be sending to the highest-latency clients first?
Does it matter how I write the outgoing messages to the sockets?
Currently I'm writing them as one large chunk. The size can exceed a single packet.
Would it be faster for the client to begin its processing if I sent messages in smaller chunks than 1 packet?
Would it be better to write 1 packet worth to each client at a time, and iterate over the clients multiple times?
Are there any linux/networking configurations that would bear impact here?
Thanks in advance for your feedback and tips.
Does it matter in which order I send messages to the clients?
Yes, by fractions of milliseconds. If the network interface is available for sending the OS will immediately start sending. Why would it wait?
Perhaps I should be sending to the highest-latency clients first?
I think you should be sending in random order. Shuffle the list prior to sending. This makes it fair. I think your question is valid and this should be addressed.
Currently I'm writing them as one large chunk. [...]
First, realize that TCP is stream-based and that there are no packets/messages at the protocol level. On a physical level data is indeed packetized.
It is not necessary to manually split off packets because clients will read data as it arrives anyway. If a client issues a read, that read will complete immediately once the first packet has arrived. There is no artificial waiting in the OS.
Are there any linux/networking configurations that would bear impact here?
I don't know. Be sure to disable nagling.

Ensuring send() data delivered

Is there any way of checking if data sent using winsock's send() or WSASend() are really delivered to destination?
I'm writing an application talking with third party server, which sometimes goes down after working for some time, and need to be sure if messages sent to that server are delivered or not. The problem is sometimes calling send() finishes without error, even if server is already down, and only next send() finishes with error - so I have no idea if previous message was delivered or not.
I suppose on TCP layer there is information if certain (or all) packets sent were acked or not, but it is not available using socket interface (or I cannot find a way).
Worst of all, I cannot change the code of the server, so I can't get any delivery confirmation messages.
I'm sorry, but given what you're trying to achieve, you should realise that even if the TCP stack COULD give you an indication that a particular set of bytes has been ACK'd by the remote TCP stack it wouldn't actually mean anything different to what you know at the moment.
The problem is that unless you have an application level ACK from the remote application which is only sent once the remote application has actioned the data that you have sent to it then you will never know for sure if the data has been received by the remote application.
'but I can assume its close enough'
is just delusional. You may as well make that assumption if your send completes as it's about as valid.
The issue is that even if the TCP stack could tell you that the remote stack had ACK'd the data (1) that's not the same thing as the remote application receiving the data (2) and that is not the same thing as the remote application actually USING the data (3).
Given that the remote application COULD crash at any point, 1, 2 OR 3 the only worthwhile indication that the data has arrived is one that is sent by the remote application after it has used the data for the intended purpose.
Everything else is just wishful thinking.
Not from the return to send(). All send() says is that the data was pushed into the send buffer. Connected socket streams are not guarenteed to send all the data, just that the data will be in order. So you can't assume that your send() will be done in a single packet or if it will ever occur due to network delay or interruption.
If you need a full acknowledgement, you'll have to look at higher application level acks (server sending back a formatted ack message, not just packet ACKs).

Best socket options for client and sever that continuously transfer data

I am using Java (although I think the socket options is implement in most languages) to implement a client and server. The server sends data to the client for processing which the client acknowledges. On another port the client then sends the results of the processing back to the server. When it comes to options such as
SO_LINGER
SO_KEEPALIVE
SO_NODELAY
SO_REUSEADDRESS
SO_SENDBUFFER
SO_RECBUFFER
TCP_NODELAY
We have noticed that the connection between the client and server occasionally breaks. There will be a timeout on the send or the receive. When this happens will kill the socket and open a new one to continue.
What would be the best options to set in terms of the above scenario and is there anything that we could do from our side (programmatically or options-wise) to try minimize the amount of times the connection is dropped. We are using normal TCP/IP.
UPDATE:
The bounty on this ends soon. I haven't had a satisfactory answer yet so it is still open. I think everyone is missing the point of the quest. What is the best practice with regards to the options above for sockets that continuously chat. I have already got a ping packet in that if there is no work to be done (hardly ever the scenario) the normal message is sent with no inner elements so there is always processing.
Strictly speaking, you don't need any of these socket options:
* SO_LINGER
You need to set SO_LINGER only if your application still has outstanding packets to send when close(2) or shutdown(2) has been called. Not really applicable for your application.
* SO_KEEPALIVE
Sending keepalive-pings every two hours would really only help very long-lived but -very- quiet connections going through stateful firewalls with very long session timeouts. (Two hours between pings is entirely too long to be practical in today's Internet.)
* SO_NODELAY
This (presumably an alias for TCP_NODELAY) disables Nagle's algorithm, which is just a small-packet-avoidance problem. Perhaps Nagle is getting in the way in your application, but it takes special sequences of packets to introduce 500ms delays into processing; it never just hangs connections.
* SO_REUSEADDRESS
Useful for all 'servers' that listen on well-known port numbers; use on 'clients' is almost always covering up some bug or other, but it is sometimes necessary if requests must come from a well-known port number.
* SO_SENDBUFFER
* SO_RECBUFFER
These buffer sizes influence the kernel-side buffer sizes maintained for receiving or sending data while your program (receive buffer) or the socket (send buffer) isn't yet ready to accept more data. If these are set too small, your application might not transfer data as smoothly as possible, reducing throughput, but it should not lead to any stalls if these are set smaller than optimal. Of course, too large may put unreasonable demands on kernel memory, but there should be a reasonable system-wide maximum allowed size.
* TCP_NODELAY
Disables Nagle. Not likely to do more than introduce 500ms delays if your application sends multiple small packets before attempting a blocking read.
Really, you shouldn't need to set any socket options.
Can you distill your code into something that could be pasted here and tested or inspected? I'm used to TCP sessions surviving for days or weeks without trouble, so this is pretty surprising.
First I think that this page is relevant, regarding half-open connections.
http://nitoprograms.blogspot.com/2009/05/detection-of-half-open-dropped.html
That being said, TCP is designed to hide connection problems, so you may often find yourself in cases where the connection is broken, but neither side thinks it is. You have addressed this partially by using timeouts and taking that as a sign the connection is broken.
Since you are writing the client and server, I would avoid relying on TCP to tell you when the connection is broken altogether. I would just have the server also acknowledge the receipt of the result from the client. Then both sides will expect immediate responses to their messages, and you can track which messages have been ack'd and set an appropriately small timeout for receiving the ack. This is not a timeout on the send or receive, but a timeout on the time between sending a message and receiving the ack for that message. Then you can set the timeout appropriately depending on the quality of your connection (e.g. very small if you are running on loopback, but large if running over wireless with a weak signal).
Regarding the options you list, you will want to use SO_REUSEADDRESS so that you won't be prevented from reopening the socket, for example if it hasn't finished closing from a previously killed process.
You probably have, but it is best to check the obvious....
Have you verified that it IS the socket that is timing out, and not your code? Sockets are fairly stable, and while there might be an issue somewhere, it seems more likely that it is in your code. I would use logs, timestamps, and synchronised clocks to be sure.
There may be an issue that you genuinely DO take a long time to do the calculation, so maybe adding a 'I'm still thinking about it' message to your protocol that gets sent regularly, to keep the connection alive?
Of course networks will drop out from time to time regardless of what you do, and it sounds like you are already handling that case nicely.
try these options
SO_LINGER - for specyfying when the Socket close s called while some unsent data in the queue
TCP_NODELAY - For non blocking datat transfer
I would strongly encourage you to use a ping/echo model between client and server, so that if no data is sent for x seconds a ping message needs to be send. A typical reason for a break might be a firewall, which shuts down socketss because of inactivity.
The typical issue where the TCP model fails are physical problems e.g. a pulled/broken cable and hangs on one side, where technically someone is listening until a queue overrun kicks in (which might never happen given your amount of data).
What are the chances the connection is going through a NAT firewall somewhere along the way? Stateful firewalls maintain a table of open connections so that packets belonging to an allowed connection can quickly pass through the system, without forcing firewall admins to write overly-complex rule sets.
The downside is that this table can grow immensely large, so it must be pruned as connections are closed or as they appear to have simply grown stale and died quietly. A connection that has gone silent for 20 minutes is usually quiet enough to reaped. (Which is really very quick, as the TCP KEEPALIVE is typically two hours, making it nearly useless in the face of NAT firewalls.)
So: is this going through a NAT firewall? Is the connection quiet for long stretches? If so, add a ping/pong to your protocol, and fire it every few minutes.