I am trying to implement a real-time application which involves IPC across different modules. The modules are doing some data intensive processing. I am using message queue as the backbone(Activemq) for IPC in the prototype, which is easy(considering I am a totally IPC newbie), but it's very very slow.
Here is my situation:
I have isolated the IPC part so that I could change it other ways in future.
I have 3 weeks to implement another faster version. ;-(
IPC should be fast, but also comparatively easy to pick up
I have been looking into different IPC approaches: socket, pipe, shared memory. However, I have no experience in IPC, and there is definitely no way I could fail this demo in 3 weeks... Which IPC will be the safe way to start with?
Thanks.
Lily
Best results you'll get with Shared Memory solution.
Recently I met with the same IPC benchmarking. And I think my results will be useful for all who want to compare IPC performance.
Pipe benchmark:
Message size: 128
Message count: 1000000
Total duration: 27367.454 ms
Average duration: 27.319 us
Minimum duration: 5.888 us
Maximum duration: 15763.712 us
Standard deviation: 26.664 us
Message rate: 36539 msg/s
FIFOs (named pipes) benchmark:
Message size: 128
Message count: 1000000
Total duration: 38100.093 ms
Average duration: 38.025 us
Minimum duration: 6.656 us
Maximum duration: 27415.040 us
Standard deviation: 91.614 us
Message rate: 26246 msg/s
Message Queues benchmark:
Message size: 128
Message count: 1000000
Total duration: 14723.159 ms
Average duration: 14.675 us
Minimum duration: 3.840 us
Maximum duration: 17437.184 us
Standard deviation: 53.615 us
Message rate: 67920 msg/s
Shared Memory benchmark:
Message size: 128
Message count: 1000000
Total duration: 261.650 ms
Average duration: 0.238 us
Minimum duration: 0.000 us
Maximum duration: 10092.032 us
Standard deviation: 22.095 us
Message rate: 3821893 msg/s
TCP sockets benchmark:
Message size: 128
Message count: 1000000
Total duration: 44477.257 ms
Average duration: 44.391 us
Minimum duration: 11.520 us
Maximum duration: 15863.296 us
Standard deviation: 44.905 us
Message rate: 22483 msg/s
Unix domain sockets benchmark:
Message size: 128
Message count: 1000000
Total duration: 24579.846 ms
Average duration: 24.531 us
Minimum duration: 2.560 us
Maximum duration: 15932.928 us
Standard deviation: 37.854 us
Message rate: 40683 msg/s
ZeroMQ benchmark:
Message size: 128
Message count: 1000000
Total duration: 64872.327 ms
Average duration: 64.808 us
Minimum duration: 23.552 us
Maximum duration: 16443.392 us
Standard deviation: 133.483 us
Message rate: 15414 msg/s
Been facing a similar question myself.
I've found the following pages helpful - IPC performance: Named Pipe vs Socket (in particular) and Sockets vs named pipes for local IPC on Windows?.
It sounds like the concensus is that shared memory is the way to go if you're really concerned about performance, but if the current system you have is a message queue it might be a rather... different structure. A socket and/or named pipe might be easier to implement, and if either meets your specs then you're done there.
On Windows, you can use WM_COPYDATA, a special kind of shared memory-based IPC. This is an old, but simple technique: "Process A" sends a message, which contains a pointer to some data in its memory, and waits until "Process B" processes (sorry) the message, e.g. creates a local copy of the data. This method is pretty fast and works on Windows 8 Developer Preview, too (see my benchmark). Any kind of data can be transported this way, by serializing it on the sender, and deserializing it on the receiver side. It's also simple to implement sender and receiver message queues, to make the communication asynchronous.
You may check out this blog post https://publicwork.wordpress.com/2016/07/17/endurox-vs-zeromq/
Basically it compares Enduro/X, which is built on POSIX queues ( kernel queues IPC ) versus ZeroMQ, which may deliver messages simultaneously on several different transport classes, incl. tcp:// ( network sockets ), ipc://, inproc://, pgm:// and epgm:// for multicast.
From charts you may see that at some point with larger data packets Enduro/X running on queues wins over the sockets.
Both systems are running good with ~400 000 messages per second, but with 5KB messages, kernel queues are running better.
(image source: https://publicwork.wordpress.com/2016/07/17/endurox-vs-zeromq/)
UPDATE:
Another update as answer to bellow comment, I did rerun test to run ZeroMQ on ipc:// too, see the picture:
As we see the ZeroMQ ipc:// is better, but again in some range Enduro/X shows the better results and then again ZeroMQ takes over.
Thus I could say that IPC selection depends on the work you plan to do.
Note that ZeroMQ IPC runs on POSIX pipes. While Enduro/x runs on POSIX queues.
Related
I've read many stack overflow questions similar to this, but I don't think any of the answers really satisfied my curiosity. I have an example below which I would like to get some clarification.
Suppose the client is blocking on socket.recv(1024):
socket.recv(1024)
print("Received")
Also, suppose I have a server sending 600 bytes to the client. Let us assume that these 600 bytes are broken into 4 small packets (of 150 bytes each) and sent over the network. Now suppose the packets reach the client at different timings with a difference of 0.0001 seconds (eg. one packet arrives at 12.00.0001pm and another packet arrives at 12.00.0002pm, and so on..).
How does socket.recv(1024) decide when to return execution to the program and allow the print() function to execute? Does it return execution immediately after receiving the 1st packet of 150 bytes? Or does it wait for some arbitrary amount of time (eg. 1 second, for which by then all packets would have arrived)? If so, how long is this "arbitrary amount of time"? Who determines it?
Well, that will depend on many things, including the OS and the speed of the network interface. For a 100 gigabit interface, the 100us is "forever," but for a 10 mbit interface, you can't even transmit the packets that fast. So I won't pay too much attention to the exact timing you specified.
Back in the day when TCP was being designed, networks were slow and CPUs were weak. Among the flags in the TCP header is the "Push" flag to signal that the payload should be immediately delivered to the application. So if we hop into the Waybak
machine the answer would have been something like it depends on whether or not the PSH flag is set in the packets. However, there is generally no user space API to control whether or not the flag is set. Generally what would happen is that for a single write that gets broken into several packets, the final packet would have the PSH flag set. So the answer for a slow network and weakling CPU might be that if it was a single write, the application would likely receive the 600 bytes. You might then think that using four separate writes would result in four separate reads of 150 bytes, but after the introduction of Nagle's algorithm the data from the second to fourth writes might well be sent in a single packet unless Nagle's algorithm was disabled with the TCP_NODELAY socket option, since Nagle's algorithm will wait for the ACK of the first packet before sending anything less than a full frame.
If we return from our trip in the Waybak machine to the modern age where 100 Gigabit interfaces and 24 core machines are common, our problems are very different and you will have a hard time finding an explicit check for the PSH flag being set in the Linux kernel. What is driving the design of the receive side is that networks are getting way faster while the packet size/MTU has been largely fixed and CPU speed is flatlining but cores are abundant. Reducing per packet overhead (including hardware interrupts) and distributing the packets efficiently across multiple cores is imperative. At the same time it is imperative to get the data from that 100+ Gigabit firehose up to the application ASAP. One hundred microseconds of data on such a nic is a considerable amount of data to be holding onto for no reason.
I think one of the reasons that there are so many questions of the form "What the heck does receive do?" is that it can be difficult to wrap your head around what is a thoroughly asynchronous process, wheres the send side has a more familiar control flow where it is much easier to trace the flow of packets to the NIC and where we are in full control of when a packet will be sent. On the receive side packets just arrive when they want to.
Let's assume that a TCP connection has been set up and is idle, there is no missing or unacknowledged data, the reader is blocked on recv, and the reader is running a fresh version of the Linux kernel. And then a writer writes 150 bytes to the socket and the 150 bytes gets transmitted in a single packet. On arrival at the NIC, the packet will be copied by DMA into a ring buffer, and, if interrupts are enabled, it will raise a hardware interrupt to let the driver know there is fresh data in the ring buffer. The driver, which desires to return from the hardware interrupt in as few cycles as possible, disables hardware interrupts, starts a soft IRQ poll loop if necessary, and returns from the interrupt. Incoming data from the NIC will now be processed in the poll loop until there is no more data to be read from the NIC, at which point it will re-enable the hardware interrupt. The general purpose of this design is to reduce the hardware interrupt rate from a high speed NIC.
Now here is where things get a little weird, especially if you have been looking at nice clean diagrams of the OSI model where higher levels of the stack fit cleanly on top of each other. Oh no, my friend, the real world is far more complicated than that. That NIC that you might have been thinking of as a straightforward layer 2 device, for example, knows how to direct packets from the same TCP flow to the same CPU/ring buffer. It also knows how to coalesce adjacent TCP packets into larger packets (although this capability is not used by Linux and is instead done in software). If you have ever looked at a network capture and seen a jumbo frame and scratched your head because you sure thought the MTU was 1500, this is because this processing is at such a low level it occurs before netfilter can get its hands on the packet. This packet coalescing is part of a capability known as receive offloading, and in particular lets assume that your NIC/driver has generic receive offload (GRO) enabled (which is not the only possible flavor of receive offloading), the purpose of which is to reduce the per packet overhead from your firehose NIC by reducing the number of packets that flow through the system.
So what happens next is that the poll loop keeps pulling packets off of the ring buffer (as long as more data is coming in) and handing it off to GRO to consolidate if it can, and then it gets handed off to the protocol layer. As best I know, the Linux TCP/IP stack is just trying to get the data up to the application as quickly as it can, so I think your question boils down to "Will GRO do any consolidation on my 4 packets, and are there any knobs I can turn that affect this?"
Well, the first thing you can do is disable any form of receive offloading (e.g. via ethtool), which I think should get you 4 reads of 150 bytes for 4 packets arriving like this in order, but I'm prepared to be told I have overlooked another reason why the Linux TCP/IP stack won't send such data straight to the application if the application is blocked on a read as in your example.
The other knob you have if GRO is enabled is GRO_FLUSH_TIMEOUT which is a per NIC timeout in nanoseconds which can be (and I think defaults to) 0. If it is 0, I think your packets may get consolidated (there are many details here including the value of MAX_GRO_SKBS) if they arrive while the soft IRQ poll loop for the NIC is still active, which in turn depends on many things unrelated to your four packets in your TCP flow. If non-zero, they may get consolidated if they arrive within GRO_FLUSH_TIMEOUT nanoseconds, though to be honest I don't know if this interval could span more than one instantiation of a poll loop for the NIC.
There is a nice writeup on the Linux kernel receive side here which can help guide you through the implementation.
A normal blocking receive on a TCP connection returns as soon as there is at least one byte to return to the caller. If the caller would like to receive more bytes, they can simply call the receive function again.
I keep hearing people say, to get a better throughput you create multiple socket connection.
But my understanding is that however many tcp sockets you open between two end points. the ip layer is still one. So not sure where this additional throughput comes from
The additional throughput comes from increasing the amount of data sent in the first couple of round trip times (RTTs). TCP can send only IW packets the first round trip time (RTT). The amount is then doubled each RTT (slow start). If you open 4 connections you can send 4 * IW packets the first RTT. The throughput is quadrupled.
Lets say that a client requests a file that requires IW+1 packets. Opening two connections can complete the sending in one RTT, rather than two RTTS.
HOWEVER, this comes at a price. The initial packets are sent as a burst, which can cause severe congestion and packet loss.
I'm running a client server configuration over Ethernet and measuring packet latency at both ends. The client (windows) is sending packets every 5 ms (confirmed with wire shark) as it should. Yet, the server (embedded linux) only receives packets at 5 ms intervals for a few seconds, at which point it stops for 300 ms. After this break the latency is only 20 us. After another period of about a few seconds it takes another break for 300 ms. This repeats indefinitely (300ms break, 20 us packet latency burst). It seems as if the server program is being optimized mid-execution to read IO in shorter bursts. Why is this happening?
Disclaimer: I haven't posted the code, as the client and server are small subsets of more complex applications, however, I am willing to factor it out if an obvious answer doesn't present itself.
This is UDP so there is no handshake or any flow control mechanism. Those 300 ms must be because of work the server is doing in the processing of the UDP messages received. During those 300 ms the server has surely lost around 60 messages that were not read from client.
You might probably want to check the server does not take more than 5 ms in processing each message if it uses one thread to process. If the server uses multi-threading to process the messages and the processing takes some time, even if it takes 1 ms, you might be in a situation where at some point all threads are competing for resources and they don't finish in time to read the next message. For the problem you are describing I would bet the server is multithreaded and you have that problem. I cannot assure that 100% for lack of info though. But in any case, you want to check the time it takes to process messages because you might be dealing with real-time requirements.
I spaced out the measurements to 1 in every 1000 packets and now it is behaving itself. I was using printf every 5ms which must've eventually filled the printf tx queue entirely. This then delayed the execution for 300ms. Once printf caught its breath, the program had a queue full of incoming packets and thus was seemingly receiving packets every 20 us.
Is the main difference that?
Interrupt coalescing (ethtool -C eth1 rx-usecs 0) - coalesce the received packets from different connections, i.e. increase bandwitdh, but increase the latency of the receive
Nagle algorithm (socket options = TCP_NODELAY) - coalesce the sent packets from the same connection, i.e. increase bandwitdh, but increasethe the latency of the send
Interrupt coalescing concerns the network driver: the idea is to avoid invoking the interrupt handler anew every time a network packet shows up. Instead, after receiving a packet, the NIC waits until M packets are received or until N microseconds have passed before generating an interrupt. Then the driver can process many packets at once. (Otherwise, with modern gigabit and 10-gigabit adapters, the processor would need to field hundreds of thousands or millions of interrupts per second, which can prevent the system from being able to accomplish much else.) As your link points out, there is (or at least may be) a cost of additional latency since the OS doesn't start processing a received packet at the earliest possible instant.
Nagle's algorithm is focused on reducing the number of packets sent by coalescing payload data from multiple packets into one. The classic example is a telnet session. Without Nagle, every time you press a key, the system has to create an entire new packet (min 64 bytes on Ethernet) to send one byte.
So the intent of interrupt coalescing is to support greater bandwidth utilization, while the intent of Nagle's algorithm is actually to produce lower bandwidth (by sending fewer packets).
I'm trying to understand the performance numbers I'm getting and how to determine the optimal number of threads.
See the bottom of this post for my results
I wrote an experimental multi-threaded web client in perl which downloads a page, grabs the source for each image tag and downloads the image - discarding the data.
It uses a non-blocking connect with an initial per file timeout of 10 seconds which doubles after each timeout and retry. It also caches IP addresses so each thread only has to do a DNS lookup once.
The total amount of data downloaded is 2271122 bytes in 1316 files via 2.5Mbit connection from http://hubblesite.org/gallery/album/entire/npp/all/hires/true/ . The thumbnail images are hosted by a company which claims to specialize in low latency for high bandwidth applications.
Wall times are:
1 Thread takes 4:48 -- 0 timeouts
2 Threads takes 2:38 -- 0 timeouts
5 Threads takes 2:22 -- 20 timeouts
10 Threads take 2:27 -- 40 timeouts
50 Threads take 2:27 -- 170 timeouts
In the worst case ( 50 threads ) less than 2 seconds of CPU time are consumed by the client.
avg file size 1.7k
avg rtt 100 ms ( as measured by ping )
avg cli cpu/img 1 ms
The fastest average download speed is 5 threads at about 15 KB / sec overall.
The server actually does seem to have pretty low latency as it takes only 218 ms to get each image meaning it takes only 18 ms on average for the server to process each request:
0 cli sends syn
50 srv rcvs syn
50 srv sends syn + ack
100 cli conn established / cli sends get
150 srv recv's get
168 srv reads file, sends data , calls close
218 cli recv HTTP headers + complete file in 2 segments MSS == 1448
I can see that the per file average download speed is low because of the small file sizes and the relatively high cost per file of the connection setup.
What I don't understand is why I see virtually no improvement in performance beyond 2 threads. The server seems to be sufficiently fast, but already starts timing out connections at 5 threads.
The timeouts seem to start after about 900 - 1000 successful connections whether it's 5 or 50 threads, which I assume is probably some kind of throttling threshold on the server, but I would expect 10 threads to still be significantly faster than 2.
Am I missing something here?
EDIT-1
Just for comparisons sake I installed the DownThemAll Firefox extension and downloaded the images using it. I set it to 4 simultaneous connections with a 10 second timeout. DTM took about 3 minutes to download all the files + write them to disk, and it also started experiencing timeouts after about 900 connections.
I'm going to run tcpdump to try and get a better picture what's going on at the tcp protocol level.
I also cleared Firefox's cache and hit reload. 40 Seconds to reload the page and all the images. That seemed way too fast - maybe Firefox kept them in a memory cache which wasn't cleared? So I opened Opera and it also took about 40 seconds. I assume they're so much faster because they must be using HTTP/1.1 pipelining?
And the Answer Is!??
So after a little more testing and writing code to reuse the sockets via pipelining I found out some interesting info.
When running at 5 threads the non-pipelined version retrieves the first 1026 images in 77 seconds but takes a further 65 seconds to retrieve the remaining 290 images. This pretty much confirms MattH's theory about my client getting hit by a SYN FLOOD event causing the server to stop responding to my connection attempts for a short period of time. However, that is only part of the problem since 77 seconds is still very slow for 5 threads to get 1026 images; if you remove the SYN FLOOD issue it would still take about 99 seconds to retrieve all the files. So based on a little research and some tcpdump's it seems like the other part of the issue is latency and the connection setup overhead.
Here's where we get back to the issue of finding the "Sweet Spot" or the optimal number of threads. I modified the client to implement HTTP/1.1 Pipelining and found that the optimal number of threads in this case is between 15 and 20. For example:
1 Thread takes 2:37 -- 0 timeouts
2 Threads takes 1:22 -- 0 timeouts
5 Threads takes 0:34 -- 0 timeouts
10 Threads take 0:20 -- 0 timeouts
11 Threads take 0:19 -- 0 timeouts
15 Threads take 0:16 -- 0 timeouts
There are four factors which
affect this; latency / rtt , maximum end-to-end bandwidth, recv buffer size
and the size of the image files being downloaded. See this site for a
discussion on how receive buffer size and RTT latency affect available
bandwidth.
In addition to the above, average file size affects the maximum per connection
transfer rate. Every time you issue a GET request you create an empty gap in
your transfer pipe which is the size of the connection RTT. For example, if
you're Maximum Possible Transfer Rate ( recv buff size / RTT ) is 2.5Mbit and
your RTT is 100ms, then every GET request incurs a minimum 32kB gap in your
pipe. For a large average image size of 320kB that amounts to a 10% overhead
per file, effectively reducing your available bandwidth to 2.25Mbit. However,
for a small average file size of 3.2kB the overhead jumps to 1000% and
available bandwidth is reduced to 232 kbit / second - about 29kB.
So to find the optimal number of threads:
Gap Size = MPTR * RTT
MPTR / (MPTR / Gap Size + AVG file size) * AVG file size)
For my above scenario this gives me an optimum thread count of 11 threads, which is extremely close to my real world results.
If the actual connection speed is slower than the theoretical MPTR then it
should be used in the calculation instead.
Please correct me this summary is incorrect:
Your multi-threaded client will start a thread that connects to the server and issues just one HTTP GET then that thread closes.
When you say 1, 2, 5, 10, 50 threads, you're just referring to how many concurrent threads you allow, each thread itself only handles one request
Your client takes between 2 and 5 minutes to download over 1000 images
Firefox and Opera will download an equivalent data set in 40 seconds
I suggest that the server rate-limits http connections, either by the webserver daemon itself, a server-local firewall or most likely dedicated firewall.
You are actually abusing the webservice by not re-using the HTTP Connections for more than one request and that the timeouts you experience are because your SYN FLOOD is being clamped.
Firefox and Opera are probably using between 4 and 8 connections to download all of the files.
If you redesign your code to re-use the connections you should achieve similar performance.