Implement a good performing "to-send" queue with TCP - winsock

In order not to flood the remote endpoint my server app will have to implement a "to-send" queue of packets I wish to send.
I use Windows Winsock, I/O Completion Ports.
So, I know that when my code calls "socket->send(.....)" my custom "send()" function will check to see if a data is already "on the wire" (towards that socket).
If a data is indeed on the wire it will simply queue the data to be sent later.
If no data is on the wire it will call WSASend() to really send the data.
So far everything is nice.
Now, the size of the data I'm going to send is unpredictable, so I break it into smaller chunks (say 64 bytes) in order not to waste memory for small packets, and queue/send these small chunks.
When a "write-done" completion status is given by IOCP regarding the packet I've sent, I send the next packet in the queue.
That's the problem; The speed is awfully low.
I'm actually getting, and it's on a local connection (127.0.0.1) speeds like 200kb/s.
So, I know I'll have to call WSASend() with seveal chunks (array of WSABUF objects), and that will give much better performance, but, how much will I send at once?
Is there a recommended size of bytes? I'm sure the answer is specific to my needs, yet I'm also sure there is some "general" point to start with.
Is there any other, better, way to do this?

Of course you only need to resort to providing your own queue if you are trying to send data faster than the peer can process it (either due to link speed or the speed that the peer can read and process the data). Then you only need to resort to your own data queue if you want to control the amount of system resources being used. If you only have a few connections then it is likely that this is all unnecessary, if you have 1000s then it's something that you need to be concerned about. The main thing to realise here is that if you use ANY of the asynchronous network send APIs on Windows, managed or unmanaged, then you are handing control over the lifetime of your send buffers to the receiving application and the network. See here for more details.
And once you have decided that you DO need to bother with this you then don't always need to bother, if the peer can process the data faster than you can produce it then there's no need to slow things down by queuing on the sender. You'll see that you need to queue data because your write completions will begin to take longer as the overlapped writes that you issue cannot complete due to the TCP stack being unable to send any more data due to flow control issues (see http://www.tcpipguide.com/free/t_TCPWindowSizeAdjustmentandFlowControl.htm). At this point you are potentially using an unconstrained amount of limited system resources (both non-paged pool memory and the number of memory pages that can be locked are limited and (as far as I know) both are used by pending socket writes)...
Anyway, enough of that... I assume you already have achieved good throughput before you added your send queue? To achieve maximum performance you probably need to set the TCP window size to something larger than the default (see http://msdn.microsoft.com/en-us/library/ms819736.aspx) and post multiple overlapped writes on the connection.
Assuming you already HAVE good throughput then you need to allow a number of pending overlapped writes before you start queuing, this maximises the amount of data that is ready to be sent. Once you have your magic number of pending writes outstanding you can start to queue the data and then send it based on subsequent completions. Of course, as soon as you have ANY data queued all further data must be queued. Make the number configurable and profile to see what works best as a trade off between speed and resources used (i.e. number of concurrent connections that you can maintain).
I tend to queue the whole data buffer that is due to be sent as a single entry in a queue of data buffers, since you're using IOCP it's likely that these data buffers are already reference counted to make it easy to release then when the completions occur and not before and so the queuing process is made simpler as you simply hold a reference to the send buffer whilst the data is in the queue and release it once you've issued a send.
Personally I wouldn't optimise by using scatter/gather writes with multiple WSABUFs until you have the base working and you know that doing so actually improves performance, I doubt that it will if you have enough data already pending; but as always, measure and you will know.
64 bytes is too small.
You may have already seen this but I wrote about the subject here: http://www.lenholgate.com/blog/2008/03/bug-in-timer-queue-code.html though it's possibly too vague for you.

Related

How exactly do socket receives work at a lower level (eg. socket.recv(1024))?

I've read many stack overflow questions similar to this, but I don't think any of the answers really satisfied my curiosity. I have an example below which I would like to get some clarification.
Suppose the client is blocking on socket.recv(1024):
socket.recv(1024)
print("Received")
Also, suppose I have a server sending 600 bytes to the client. Let us assume that these 600 bytes are broken into 4 small packets (of 150 bytes each) and sent over the network. Now suppose the packets reach the client at different timings with a difference of 0.0001 seconds (eg. one packet arrives at 12.00.0001pm and another packet arrives at 12.00.0002pm, and so on..).
How does socket.recv(1024) decide when to return execution to the program and allow the print() function to execute? Does it return execution immediately after receiving the 1st packet of 150 bytes? Or does it wait for some arbitrary amount of time (eg. 1 second, for which by then all packets would have arrived)? If so, how long is this "arbitrary amount of time"? Who determines it?
Well, that will depend on many things, including the OS and the speed of the network interface. For a 100 gigabit interface, the 100us is "forever," but for a 10 mbit interface, you can't even transmit the packets that fast. So I won't pay too much attention to the exact timing you specified.
Back in the day when TCP was being designed, networks were slow and CPUs were weak. Among the flags in the TCP header is the "Push" flag to signal that the payload should be immediately delivered to the application. So if we hop into the Waybak
machine the answer would have been something like it depends on whether or not the PSH flag is set in the packets. However, there is generally no user space API to control whether or not the flag is set. Generally what would happen is that for a single write that gets broken into several packets, the final packet would have the PSH flag set. So the answer for a slow network and weakling CPU might be that if it was a single write, the application would likely receive the 600 bytes. You might then think that using four separate writes would result in four separate reads of 150 bytes, but after the introduction of Nagle's algorithm the data from the second to fourth writes might well be sent in a single packet unless Nagle's algorithm was disabled with the TCP_NODELAY socket option, since Nagle's algorithm will wait for the ACK of the first packet before sending anything less than a full frame.
If we return from our trip in the Waybak machine to the modern age where 100 Gigabit interfaces and 24 core machines are common, our problems are very different and you will have a hard time finding an explicit check for the PSH flag being set in the Linux kernel. What is driving the design of the receive side is that networks are getting way faster while the packet size/MTU has been largely fixed and CPU speed is flatlining but cores are abundant. Reducing per packet overhead (including hardware interrupts) and distributing the packets efficiently across multiple cores is imperative. At the same time it is imperative to get the data from that 100+ Gigabit firehose up to the application ASAP. One hundred microseconds of data on such a nic is a considerable amount of data to be holding onto for no reason.
I think one of the reasons that there are so many questions of the form "What the heck does receive do?" is that it can be difficult to wrap your head around what is a thoroughly asynchronous process, wheres the send side has a more familiar control flow where it is much easier to trace the flow of packets to the NIC and where we are in full control of when a packet will be sent. On the receive side packets just arrive when they want to.
Let's assume that a TCP connection has been set up and is idle, there is no missing or unacknowledged data, the reader is blocked on recv, and the reader is running a fresh version of the Linux kernel. And then a writer writes 150 bytes to the socket and the 150 bytes gets transmitted in a single packet. On arrival at the NIC, the packet will be copied by DMA into a ring buffer, and, if interrupts are enabled, it will raise a hardware interrupt to let the driver know there is fresh data in the ring buffer. The driver, which desires to return from the hardware interrupt in as few cycles as possible, disables hardware interrupts, starts a soft IRQ poll loop if necessary, and returns from the interrupt. Incoming data from the NIC will now be processed in the poll loop until there is no more data to be read from the NIC, at which point it will re-enable the hardware interrupt. The general purpose of this design is to reduce the hardware interrupt rate from a high speed NIC.
Now here is where things get a little weird, especially if you have been looking at nice clean diagrams of the OSI model where higher levels of the stack fit cleanly on top of each other. Oh no, my friend, the real world is far more complicated than that. That NIC that you might have been thinking of as a straightforward layer 2 device, for example, knows how to direct packets from the same TCP flow to the same CPU/ring buffer. It also knows how to coalesce adjacent TCP packets into larger packets (although this capability is not used by Linux and is instead done in software). If you have ever looked at a network capture and seen a jumbo frame and scratched your head because you sure thought the MTU was 1500, this is because this processing is at such a low level it occurs before netfilter can get its hands on the packet. This packet coalescing is part of a capability known as receive offloading, and in particular lets assume that your NIC/driver has generic receive offload (GRO) enabled (which is not the only possible flavor of receive offloading), the purpose of which is to reduce the per packet overhead from your firehose NIC by reducing the number of packets that flow through the system.
So what happens next is that the poll loop keeps pulling packets off of the ring buffer (as long as more data is coming in) and handing it off to GRO to consolidate if it can, and then it gets handed off to the protocol layer. As best I know, the Linux TCP/IP stack is just trying to get the data up to the application as quickly as it can, so I think your question boils down to "Will GRO do any consolidation on my 4 packets, and are there any knobs I can turn that affect this?"
Well, the first thing you can do is disable any form of receive offloading (e.g. via ethtool), which I think should get you 4 reads of 150 bytes for 4 packets arriving like this in order, but I'm prepared to be told I have overlooked another reason why the Linux TCP/IP stack won't send such data straight to the application if the application is blocked on a read as in your example.
The other knob you have if GRO is enabled is GRO_FLUSH_TIMEOUT which is a per NIC timeout in nanoseconds which can be (and I think defaults to) 0. If it is 0, I think your packets may get consolidated (there are many details here including the value of MAX_GRO_SKBS) if they arrive while the soft IRQ poll loop for the NIC is still active, which in turn depends on many things unrelated to your four packets in your TCP flow. If non-zero, they may get consolidated if they arrive within GRO_FLUSH_TIMEOUT nanoseconds, though to be honest I don't know if this interval could span more than one instantiation of a poll loop for the NIC.
There is a nice writeup on the Linux kernel receive side here which can help guide you through the implementation.
A normal blocking receive on a TCP connection returns as soon as there is at least one byte to return to the caller. If the caller would like to receive more bytes, they can simply call the receive function again.

Ajax polling vs SSE (performance on server side)

I'm curious about if there is some type of standard limit on when is better to use Ajax Polling instead of SSE, from a server side viewpoint.
1 request every second: I'm pretty sure is better SSE
1 request per minute: I'm pretty sure is better Ajax
But what about 1 request every 5 seconds? How can we calculate where is the limit frequency for Ajax or SSE?
No way is 1 request per minute always better for Ajax, so that assumption is flawed from the start. Any kind of frequent polling is nearly always a costly choice. It seems from our previous conversation in comments of another question that you start with a belief that an open TCP socket (whether SSE connection or webSocket connection) is somehow costly to server performance. An idle TCP connection takes zero CPU (maybe every once in a long while, a keep alive might be sent, but other than that, an idle socket does not use CPU). It does use a bit of server memory to handle the socket descriptor, but a highly tuned server can have 1,000,000 open sockets at once. So, your CPU usage is going to be more about how many connections are being established and what are they asking the server to do every time they are established than it is about how many open (and mostly idle) connections there are.
Remember, every http connection has to create a TCP socket (which is roundtrips between client/server), then send the http request, then get the http response, then close the socket. That's a lot of roundtrips of data to do every minute. If the connection is https, it's even more work and roundtrips to establish the connection because of the crypto layer and endpoint certification. So doing all that every minute for hundreds of thousands of clients seems like a massive waste of resources and bandwidth when you could create one SSE connection and the client just listen for data to stream from the server over that connection.
As I said in our earlier comment exchange on a different question, these types of questions are not really answerable in the abstract. You have to have specific requirements of both client and server and a specific understanding of the data being delivered and how urgent it is on the client and therefore a specific polling interval and a specific scale in order to begin to do some calculations or test harnesses to evaluate which might be the more desirable way to do things. There are simply too many variables to come up with a purely hypothetical answer. You have to define a scenario and then analyze different implementations for that specific scenario.
Number of requests per second is only one of many possible variables. For example, if most the time you poll there's actually nothing new, then that gives even more of an advantage to the SSE case because it would have nothing to do at all (zero load on the server other than a little bit of memory used for an open socket most of the time) whereas the polling creates continual load, even when nothing to do.
The #1 advantage to server push (whether implement with SSE or webSocket) is that the server only has to do anything with the client when there is actually pertinent data to send to that specific client. All the rest of the time, the socket is just sitting there idle (perhaps occasionally on a long interval, sending a keep-alive).
The #1 disadvantage to polling is that there may be lots of times that the client is polling the server and the server has to expend resources to deal with the polling request only to inform that client that it has nothing new.
How can we calculate where is the limit frequency for Ajax or SSE?
It's a pretty complicated process. Lots of variables in a specific scenario need to be defined. It's not as simple as just requests/sec. Then, you have to decide what you're attempting to measure or evaluate and at what scale? "Server performance" is the only thing you mention, but that has to be completely defined and different factors such as CPU usage and memory usage have to be weighted into whatever you're measuring or calculating. Then, you may even need to run some test harnesses if the calculations don't yield an obvious answer or if the decision is so critical that you want to verify your calculations with real metrics.
It sounds like you're looking for an answer like "at greater than x requests/min, you should use polling instead of SSE" and I don't think there is an answer that simple. It depends upon far more things than requests/min or requests/sec.
"Polling" incurs overhead on all parties. If you can avoid it, don't poll.
If SSE is an option, it might be a good choice. "It depends".
Q: What (if any) kind of "event(s)" will your app need to handle?

Push data to client, how to handle slow clients?

In a push model, where server pushes data to clients, how does one handle clients with low or variable bandwidth?
For example i receive data from a producer and send the data to my clients (push). What if one of my clients decides to download a linux iso, the available bandwidth to this client becomes too little to download my data.
Now when my producers produces data and the server pushes it to the client, all clients will have to wait until all clients have downloaded the data. This is a problem when there is one or more slow clients with little bandwidth.
I can cache the data to be send for every client, but because the data size is big this isn't really an option (lots of clients * data size = huge memory requirements).
How is this generally solved? No need for code, just a few thoughts/ideas are already more then welcome.
Now when my producers produces data and the server pushes it to the
client, all clients will have to wait until all clients have
downloaded the data.
The above shouldn't be the case -- your clients should be able to download asynchronously from each other, with each client maintaining its own independent download state. That is, client A should never have to wait for client B to finish, and vice versa.
I can cache the data to be send for every client, but because the data
size is big this isn't really an option (lots of clients * data size =
huge memory requirements).
As Warren said in his answer, this problem can be reduced by keeping only one copy of the data rather than one copy per client. Reference-counting (e.g. via shared_ptr, if you are using C++, or something equivalent in another language) is an easy way to make sure that the shared data is deleted only when all clients are done downloading it. You can make the sharing more fine-grained, if necessary, by breaking up the data into chunks (e.g. instead of all clients holding a reference to a single 800MB linux iso, you could break it up into 800 1MB chunks, so that you can start removing the earlier chunks from memory as soon as all clients have downloaded them, instead of having to hold the entire 800MB of data in memory until every client has downloaded the entire thing)
Of course, that sort of optimization only gets you so far -- e.g. if two clients each request a different 800MB file, then you're liable to end up with 1.6GB of RAM usage for caching, unless you come up with a more clever solution.
Here are some possible approaches you could try (from less complex to more complex). You could try any of these either separately or in combination:
Monitor how much each client's "backlog" is -- that is, keep a count of the amount of data you have cached waiting to send to that client. Also keep track of the number of bytes of cached data your server is currently holding; if that number gets too high, force-disconnect the client with the largest backlog, in order to free up memory. (this doesn't result in a good user experience for the client, of course; but if the client has a buggy or slow connection he was unlikely to have a good user experience anyway. It does keep your server from crashing or swapping itself to death because a single client has a bad connection)
Keep track of how much data your server has cached and waiting to send out. If the amount of data you have cached is too large (for some appropriate value of "too large"), temporarily stop reading from the socket(s) that are pushing the data out to you (or if you are generating your data internally, temporarily stop generating it). Once the amount of cached data gets down to an acceptable level again, you can resume receiving (or generating) more data to push.
(this may or may not be applicable to your use-case) Revise your data model so that instead of being communications-oriented, it becomes state-oriented. For example, if your goal is to update the clients' state to match the state of the data-source, and you can organize the data-source's state into a set of key/value pairs, then you can require that the data-source include a key with each piece of data it sends. Whenever a key/value pair is received from the data-source, simply place that key-value pair into a map (or hash table or some other key/value oriented data structure) for each client (again, used shared_ptr's or similar here to keep memory usage reasonable). Whenever a given client has drained its queue of outgoing TCP data, remove the oldest item from that client's key/value map, convert it into TCP bytes to send, and add them to the outgoing-TCP-data queue. Repeat as necessary. The advantage of this is that "obsolete" values for a given key are automatically dropped inside the server and therefore never need to be sent to the slow clients; rather the slow clients will only ever get the "latest" value for that given key. The beneficial consequence of that is that a given client's maximum "backlog" will be limited by the number of keys in the state-model, regardless of how slow or intermittent the client's bandwidth is. Thus a slow client might see fewer updates (per second/minute/hour), but the updates it does see will still be as recent as possible given its bandwidth.
Cache the data once only, and have each client handler keep track of where it is in the download, all using the same cache. Once all clients have all the data, the cached data can be deleted.

Rely on socket setReceiveBufferSize or use a BlockingQueue?

I am doing an application that will imply reading a lot of data sent to my socket.
The problem I have is whether if should I rely on the socket setReceiveBufferSize, put a big value there to hope that it will gather all the data that I have until I am able to process it, or use a BlockingQueue to put everything there and then process it from another thread that keeps pooling and processing data?
Also is it a bad design if I let the queue with the max number of elements? ( so I'm just telling it, "yeah receive as many element as you'd like"), I'm referring to the memory consumption if I will receive a really big number of elements?
Regards,
Aurelian
A large socket receive buffer is always a good idea, but for TCP windowing/throughput reasons, not because you may be slow reading. You should be aiming to read the input as fast as possible, certainly as fast as it arrives. The proposed BlockingQueue is a complete waste of time and space. If the socket receive buffer fills up, the sender will stall. That's all you need.

How much to read from socket when using select

I'm using select() to listen for data on multiple sockets. When I'm notified that there is data available, how much should I read()?
I could loop over read() until there is no more data, process the data, and then return back to the select-loop. However, I can imagine that the socket recieves so much data so fast that it temporarily 'starves' the other sockets. Especially since I am thinking of using select also for inter-thread communication (message-passing style), I'd like to keep latency low. Is this an issue in reality?
The alternative would be to always read a fixed size of bytes, and then return to the loop. The downside here would be added overhead when there is more data available than fits into my buffer.
What's the best practice here?
Not sure how this is implemented on other platforms, but on Windows the ioctlsocket(FIONREAD) call tells you how many bytes can be read by a single call to recv(). More bytes could be in the socket's queue by the time you actually call recv(). The next call to select() will report the socket is still readable, though.
The too-common approach here is to read everything that's pending on a given socket, especially if one moves to platform-specific advanced polling APIs like kqueue(2) and epoll(7) enabling edge-triggered events. But, you certainly don't have to! Flip a bit associated with that socket somewhere once you think you got enough data (but not everything), and do more recv(2)'s later, say at the very end of the file-descriptor checking loop, without calling select(2) again.
Then the question is too general. What are your goals? Low latency? Hight throughput? Scalability? There's no single answer to everything (well, except for 42 :)