Windows IOCP for real time - sockets

I have a question related to IOCP networking strategy for our C++ server application.
Our server simulates a lot of devices which speak over UDP with short (less than 2K) messages. The server is also bound by a soft real-time constraint of 70-100 milliseconds. Currently the networking part of the application was developed with a thread being started for every socket, which leads to hundreds of threads being started. Their job is to watch for the UDP sockets, and when the data arrives, copy it into the queue of our real-time thread.
We are being asked to support more and more devices and I was thinking that rewriting the communication module using IOCP our server would be more efficient. I developed a prototype based on the code I was able to find online, but the combination of
WSARecvFrom (Initiates receive)
GetQueuedCompletionStatus
OnDataRecieved (A method of my class that gets called when data is copied into my buffer)
does not seem efficient at all. The gaps between data arrival on a given socket are 500-600 milliseconds.
I only started prototyping and did not profile a whole lot.
My question are:
Can IOCP be used for my scenario or is it designed for high throughput only?
Will WSAAsyncSelect (with hidden windows) be more efficient for my use case?
Thanks in advance,
Michael
Edit:
I noticed while profiling that the problem starts with:
- WSASendTo
- GetQueuedCompletionStatus
- OnDataSent
Looks like GetQueuedCompletionStatus doesn't wake up fast enough.

Related

How can I automatically test a networking (TCP/IP) application?

I teach students to develop network applications, both clients and servers. At this moment, we have not yet touched existing protocols such as HTTP, SMTP, etc. The students write very simple programs on top of the plain socket API. Currently I check a students' work manually, but I want to automate this task and create an automated test bench for networking applications. The most interesting topics for testing are:
Breaking TCP segments into small parts and delivering them with a noticeable delay. A reason I need such test is that students usually just issue a read/recv call and process the received data without checking that all necessary data was received. TCP doesn't guarantee the message boundaries, so in certain circumstances it is necessary to make several read/recv calls. The problem is that in most simple network applications (for example, in a chat application) messages are small and fit into the single TCP segment, so the issue doesn't appear. My idea is to artificially break messages into several small TCP segments (i.e. several bytes of data) so the problem will appear.
Pausing the data transfer for some time to simulate multiple slow clients and check that the multithreading/async sockets are implemented properly in the students' servers.
Resetting a connection in random moments of time.
I've found several systems which simulate a bad network (dummynet, clumsy, netem). Hovewer, they all work on the IP level of the stack, so OS and it's TCP implementation will compensate the data loss. Such systems are able to solve the task number 2, but they are not able to solve tasks 1 and 3. So I think that I need to develop my own solution, which will act as a TCP proxy. My questions are:
Maybe the are any libraries or applications which can (at least partially) solve the given tasks, so I'll be able to use them as a base for my own solution?
In case there is none any suitable existing software projects, maybe there are any ideas and approaches about how to do this properly?
From WireShark mailing list - Creating and Modifying Packets:
...There's a "Tools" page on the Wireshark Wiki:
http://wiki.wireshark.org/Tools
which has a "Traffic generators" section:
https://wiki.wireshark.org/Tools#Traffic_generators
which lists some tools that might be useful...
The "Traffic generators" chapter also mentions another collection of traffic generators
If you write your own socket code, you can address all 3 tasks.
enable the socket's TCP_NODELAY option (disable the Nagle Algorithm for Send Coalescing) via setsockopt(), then you can send() small fragments of data as you wish, optionally with a delay in between (see #2).
simply put a delay in between your send() calls.
use setsockopt() to adjust the socket's SO_LINGER and SO_DONTLINGER options to control whether closing the socket performs an abortive or graceful closure, then simply close the socket at some random interval after the connection is established.

Akka IO app consumes 100% cpu

I am trying to profile an Akka app that is constantly at or near 100% CPU usage. I took a CPU sample using visualvm. The sample indicates that there are 2 threads that make up 98.9% of CPU usage. 79% of the cpu time was spent on a method called sun.misc.Unsafe. Other answers on SO say that it just means that a thread is waiting but in the native implementation layer (outside of the jvm).
In questions similar to mine, people have been told to look elsewhere without being given specifics. Where should I look to figure out what's causing the cpu spike?
The application is a server that primarily uses Akka IO to listen for TCP socket connections.
Without seeing any of the source code, or even knowing what IO channel you are talking about (sockets, files, etc), there is very little insight that anyone here can give you.
I do have some rather general suggestions though.
First, you should be using reactive techniques and reactive IO in your application. This issue could be occurring because you are polling the status of some resource in a tight loop, or using a blocking call when you should be using a reactive one. This tends to be an anti-pattern and a performance drain exactly because you can spend CPU cycles doing nothing but "actively waiting". I recommend double checking for:
resource polling
blocking calls
system calls
disk flushes
waiting on a Future when it would be appropriate to map it instead
Second, you should NOT be using Mutexes or other thread synchronization in your application. If so, then you might be suffering from a live-lock. Unlike dead-locks, live-locks manifest with symptoms like 100% CPU usage as threads constantly lock and unlock concurrency primitives in an attempt to "catch them all". Wikipedia has a nice technical description of what a live lock looks like. With Akka in place you shouldn't have any need to use Mutexes or any thread synchronization primitives. If you are then you probably need to re-design your application.
Third, you should be throttling IO (as well as error handling like reconnection attempts). This issue could be occurring because your system lacks effective throttling. Often with data channels we leave their bandwidth unconstrained. However this can become an issue when that channel reaches 100% saturation and begins to steal resources from other parts of the system. This can happen, for example, if you are moving large files around without a reasonable limit.
Alternatively, you also need to throttle connection retries when you encounter any errors, rather than retrying immediately. Lots of systems will attempt to reconnect to a server if they lose their connection. While normally desirable, this can lead to problematic behavior if you use a naive reconnection strategy. For example, imagine a network client that was written this way:
class MyClient extends Client {
... other code...
def onDisconnect() = {
reconnect()
}
}
Every time the Client disconnects for ANY reason it will attempt to reconnect. You can see how this would cause a tight loop between the error handling code and the Client if the Wifi cut-out or a network cable was unplugged.
Fourth, your application should have well defined data sources and sinks. Your issue could be caused by a "data loop", that is some set of Akka actors that are just sending messages to the next actor in the chain, with the last actor sending the message back to the first actor in the chain. Make sure you have a clear and definite way for messages to enter and exit your system.
Fifth, use appropriate profiling and instrumentation for your application. Instrument your application using Kamon or Coda Hale's Metrics library.
Finding an appropriate profiler will be more difficult, since we as a community have far to go to develop mature tools for reactive applications. Personally I have found visualvm useful, but not always overwhelmingly helpful for detecting code paths that are CPU bound. The issue is that sampling profilers are only able to collect data when the JVM reaches a safepoint. This has the potential to bias certain code paths. The fix is to use a profiler that supports AsyncGetStackTrace.
Best of luck! And please add more context if you can.

.Net 4.5 TCP Server scale to thousands of connected clients

I need to build a TCP server using C# .NET 4.5+, it must be capable of comfortably handling at least 3,000 connected clients that will be send messages every 10 seconds and with a message size from 250 to 500 bytes.
The data will be offloaded to another process or queue for batch processing and logging.
I also need to be able to select an existing client to send and receive messages (greater then 500 bytes) messages within a windows forms application.
I have not built an application like this before so my knowledge is based on the various questions, examples and documentation that I have found online.
My conclusion is:
non-blocking async is the way to go. Stay away from creating multiple threads and blocking IO.
SocketAsyncEventArgs - Is complex and really only needed for very large systems, BTW what constitutes a very large system? :-)
BeginXXX methods will suffice (EAP).
Using TAP I can simplify 3. by using Task.Factory.FromAsync, but it only produces the same outcome.
Use a global collection to keep track of the connected tcp clients
What I am unsure about:
Should I use a ManualResetEvent when interacting with the TCP Client collection? I presume the asyc events will need to lock access to this collection.
Best way to detect a disconnected client after I have called BeginReceive. I've found the call is stuck waiting for a response so this needs to be cleaned up.
Sending messages to a specific TCP Client. I'm thinking function in custom TCP session class to send a message. Again in an async model, would I need to create a timer based process that inspects a message queue or would I create an event on a TCP Session class that has access to the TcpClient and associated stream? Really interested in opinions here.
I'd like to use a thread for the entire service and use non-blocking principals within, are there anythings I should be mindful of espcially in context of 1. ManualResetEvent etc..
Thank you for reading. I am keen to hear constructive thoughts and or links to best practices/examples. It's been a while since I've coded in c# so apologies if some of my questions are obvious. Tasks, async/await are new to me! :-)
I need to build a TCP server using C# .NET 4.5+
Well, the first thing to determine is whether it has to be base-bones TCP/IP. If you possibly can, write one that uses a higher-level abstraction, like SignalR or WebAPI. If you can write one using WebSockets (SignalR), then do that and never look back.
Your conclusions sound pretty good. Just a few notes:
SocketAsyncEventArgs - Is complex and really only needed for very large systems, BTW what constitutes a very large system? :-)
It's not so much a "large" system in the terms of number of connections. It's more a question of how much traffic is in the system - the number of reads/writes per second.
The only thing that SocketAsyncEventArgs does is make your I/O structures reusable. The Begin*/End* (APM) APIs will create a new IAsyncResult for each I/O operation, and this can cause pressure on the garbage collector. SocketAsyncEventArgs is essentially the same as IAsyncResult, only it's reusable. Note that there are some examples on the 'net that use the SocketAsyncEventArgs APIs without reusing the SocketAsyncEventArgs structures, which is completely ridiculous.
And there's no guidelines here: heavier hardware will be able to use the APM APIs for much more traffic. As a general rule, you should build a barebones APM server and load test it first, and only move to SAEA if it doesn't work on your target server's hardware.
On to the questions:
Should I use a ManualResetEvent when interacting with the TCP Client collection? I presume the asyc events will need to lock access to this collection.
If you're using TAP-based wrappers, then await will resume on a captured context by default. I explain this in my blog post on async/await.
There are a couple of approaches you can take here. I have successfully written a reliable and performant single-threaded TCP/IP server; the equivalent for modern code would be to use something like my AsyncContextThread class. It provides a context that will cause await to resume on that same thread by default.
The nice thing about single-threaded servers is that there's only one thread, so no synchronization or coordination is necessary. However, I'm not sure how well a single-threaded server would scale. You may want to give that a try and see how much load it can take.
If you do find you need multiple threads, then you can just use async methods on the thread pool; await will not have a captured context and so will resume on a thread pool thread. In this case, yes, you'd need to coordinate access to any shared data structures including your TCP client collection.
Note that SignalR will handle all of this for you. :)
Best way to detect a disconnected client after I have called BeginReceive. I've found the call is stuck waiting for a response so this needs to be cleaned up.
This is the half-open problem, which I discuss in detail on my blog. The best way (IMO) to solve this is to periodically send a "noop" keepalive message to each client.
If modifying the protocol isn't possible, then the next-best solution is to just close the connection after a no-communication timeout. This is how HTTP "persistent"/"keep-alive" connections decide to close. There's another possibile solution (changing the keepalive packet settings on the socket), but it's not as easy (requires p/Invoke) and has other problems (not always respected by routers, not supported by all OS TCP/IP stacks, etc).
Oh, and SignalR will handle this for you. :)
Sending messages to a specific TCP Client. I'm thinking function in custom TCP session class to send a message. Again in an async model, would I need to create a timer based process that inspects a message queue or would I create an event on a TCP Session class that has access to the TcpClient and associated stream? Really interested in opinions here.
If your server can send messages to any client (i.e., it's not just a request/response protocol; any part of the server can send messages to any client without the client requesting an update), then yes, you'll need a proper queue of outgoing requests because you can't (reliably) issue multiple concurrent writes on a socket. I wouldn't have the consumer be timer-based, though; there are async-compatible producer/consumer queues available (like BufferBlock<T> from TPL Dataflow, and it's not that hard to write one if you have async-compatible locks and condition variables).
Oh, and SignalR will handle this for you. :)
I'd like to use a thread for the entire service and use non-blocking principals within, are there anythings I should be mindful of espcially in context of 1. ManualResetEvent etc..
If your entire service is single-threaded, then you shouldn't need any coordination primitives at all. However, if you do use the thread pool instead of syncing back to the main thread (for scalability reasons), then you will need to coordinate. I have a coordination primitives library that you may find useful because its types have both synchronous and asynchronous APIs. This allows, e.g., one method to block on a lock while another method wants to asynchronously block on a lock.
You may have noticed a recurring theme around SignalR. Use it if you possibly can! If you have to write a bare-bones TCP/IP server and can't use SignalR, then take your initial time estimate and triple it. Seriously. Then you can get started down the path of painful TCP with my TCP/IP FAQ blog series.

Server-side Websocket implementations in non-event driven HTTP Server Environments

I am trying to understand implementations/options for server-side Websocket endpoints - particularly in Perl using PSGI/Plack and I have a question: Why are all server-side websocket implementations based around event-driven PSGI servers (Twiggy, Tatsumaki, etc.)?
I get that websocket communication is asynchronous, but a non-event driven PSGI server (say Starman) could spawn an asynchronous listener to handle the websocket side of things. I have seen (but not understood) PHP implementations of Websocket servers, so why cant the same be done with PSGI without having to change the server to an event driven one?
Underlying network logic to deal with sockets depends on platform, OS and particular software implementations.
Most common three methods are:
pulling - there is blocking constant "asking" if socket has some data. This method is well bad, as it will block execution of main thread for as long as it waits for some data.
thread per socket - each new connection involves creating new thread and asking each socket in blocking manner happens within that thread. So it wont block main thread with logic. This method is bad as creating thread for each connection is too expensive for memory, and can be around 1Mb or RAM based on OS and other criteria.
async - uses system features to "notify" your process when there is something. So you can react once your app is ready (in case of single threaded app) or even react in separate thread straight away. This method is well efficient as it saves RAM, and allows your app to work without need of waiting or asking for data. It utilises existing functionalities that most OS and platforms provide.
Taking this in account, you indeed can create single process functional way to deal with sockets traffic. But that is not efficient at all as been proven previously. That is why fully async models are major today, as most languages and platforms do support such paradigm.

Is there any benefit to using windows winsock API functions compared to BSD-style socket functions?

Is there any benefit on Windows to use the WSA winsock functions compared to the BSD-style ones?
The most significant difference is the availability of Asynchronous Event style APIs in Winsock.
With Berkeley sockets, each time you read or write your application will "block" until the network is ready, which could make your application unresponsive (unless the network I/O is handled in a different thread).
With an async interface, you can arrange for a callback function to be called as part of the normal windows message loop each time data is received or when the transmit buffer is empty.
Only if you plan to deploy to a legacy platform like Windows 95 or there is something in the winsock API that you absolutely cannot live without and you don't want to roll yourself (<-- doubtful tho).
If you design around the BSD paradigm, your code can work on other platforms with less porting work. If you assume that your network library will support asynchronous I/O (as Alnitak mentions), you're going to have to do a lot more work if that gets pulled out from under you.
Of course, if you're sure you'll never leave the warm bosom of Microsoft, feel free to go to town.
With respect to Alnitak's answer, I agree - I'd just add that you need not use a message loop to use asynch operations on sockets. Using I/O completion ports is a very scalable way to build a high-performance networked application.