TCP, HTTP and the Multi-Threading Sweet Spot - perl

I'm trying to understand the performance numbers I'm getting and how to determine the optimal number of threads.
See the bottom of this post for my results
I wrote an experimental multi-threaded web client in perl which downloads a page, grabs the source for each image tag and downloads the image - discarding the data.
It uses a non-blocking connect with an initial per file timeout of 10 seconds which doubles after each timeout and retry. It also caches IP addresses so each thread only has to do a DNS lookup once.
The total amount of data downloaded is 2271122 bytes in 1316 files via 2.5Mbit connection from http://hubblesite.org/gallery/album/entire/npp/all/hires/true/ . The thumbnail images are hosted by a company which claims to specialize in low latency for high bandwidth applications.
Wall times are:
1 Thread takes 4:48 -- 0 timeouts
2 Threads takes 2:38 -- 0 timeouts
5 Threads takes 2:22 -- 20 timeouts
10 Threads take 2:27 -- 40 timeouts
50 Threads take 2:27 -- 170 timeouts
In the worst case ( 50 threads ) less than 2 seconds of CPU time are consumed by the client.
avg file size 1.7k
avg rtt 100 ms ( as measured by ping )
avg cli cpu/img 1 ms
The fastest average download speed is 5 threads at about 15 KB / sec overall.
The server actually does seem to have pretty low latency as it takes only 218 ms to get each image meaning it takes only 18 ms on average for the server to process each request:
0 cli sends syn
50 srv rcvs syn
50 srv sends syn + ack
100 cli conn established / cli sends get
150 srv recv's get
168 srv reads file, sends data , calls close
218 cli recv HTTP headers + complete file in 2 segments MSS == 1448
I can see that the per file average download speed is low because of the small file sizes and the relatively high cost per file of the connection setup.
What I don't understand is why I see virtually no improvement in performance beyond 2 threads. The server seems to be sufficiently fast, but already starts timing out connections at 5 threads.
The timeouts seem to start after about 900 - 1000 successful connections whether it's 5 or 50 threads, which I assume is probably some kind of throttling threshold on the server, but I would expect 10 threads to still be significantly faster than 2.
Am I missing something here?
EDIT-1
Just for comparisons sake I installed the DownThemAll Firefox extension and downloaded the images using it. I set it to 4 simultaneous connections with a 10 second timeout. DTM took about 3 minutes to download all the files + write them to disk, and it also started experiencing timeouts after about 900 connections.
I'm going to run tcpdump to try and get a better picture what's going on at the tcp protocol level.
I also cleared Firefox's cache and hit reload. 40 Seconds to reload the page and all the images. That seemed way too fast - maybe Firefox kept them in a memory cache which wasn't cleared? So I opened Opera and it also took about 40 seconds. I assume they're so much faster because they must be using HTTP/1.1 pipelining?
And the Answer Is!??
So after a little more testing and writing code to reuse the sockets via pipelining I found out some interesting info.
When running at 5 threads the non-pipelined version retrieves the first 1026 images in 77 seconds but takes a further 65 seconds to retrieve the remaining 290 images. This pretty much confirms MattH's theory about my client getting hit by a SYN FLOOD event causing the server to stop responding to my connection attempts for a short period of time. However, that is only part of the problem since 77 seconds is still very slow for 5 threads to get 1026 images; if you remove the SYN FLOOD issue it would still take about 99 seconds to retrieve all the files. So based on a little research and some tcpdump's it seems like the other part of the issue is latency and the connection setup overhead.
Here's where we get back to the issue of finding the "Sweet Spot" or the optimal number of threads. I modified the client to implement HTTP/1.1 Pipelining and found that the optimal number of threads in this case is between 15 and 20. For example:
1 Thread takes 2:37 -- 0 timeouts
2 Threads takes 1:22 -- 0 timeouts
5 Threads takes 0:34 -- 0 timeouts
10 Threads take 0:20 -- 0 timeouts
11 Threads take 0:19 -- 0 timeouts
15 Threads take 0:16 -- 0 timeouts
There are four factors which
affect this; latency / rtt , maximum end-to-end bandwidth, recv buffer size
and the size of the image files being downloaded. See this site for a
discussion on how receive buffer size and RTT latency affect available
bandwidth.
In addition to the above, average file size affects the maximum per connection
transfer rate. Every time you issue a GET request you create an empty gap in
your transfer pipe which is the size of the connection RTT. For example, if
you're Maximum Possible Transfer Rate ( recv buff size / RTT ) is 2.5Mbit and
your RTT is 100ms, then every GET request incurs a minimum 32kB gap in your
pipe. For a large average image size of 320kB that amounts to a 10% overhead
per file, effectively reducing your available bandwidth to 2.25Mbit. However,
for a small average file size of 3.2kB the overhead jumps to 1000% and
available bandwidth is reduced to 232 kbit / second - about 29kB.
So to find the optimal number of threads:
Gap Size = MPTR * RTT
MPTR / (MPTR / Gap Size + AVG file size) * AVG file size)
For my above scenario this gives me an optimum thread count of 11 threads, which is extremely close to my real world results.
If the actual connection speed is slower than the theoretical MPTR then it
should be used in the calculation instead.

Please correct me this summary is incorrect:
Your multi-threaded client will start a thread that connects to the server and issues just one HTTP GET then that thread closes.
When you say 1, 2, 5, 10, 50 threads, you're just referring to how many concurrent threads you allow, each thread itself only handles one request
Your client takes between 2 and 5 minutes to download over 1000 images
Firefox and Opera will download an equivalent data set in 40 seconds
I suggest that the server rate-limits http connections, either by the webserver daemon itself, a server-local firewall or most likely dedicated firewall.
You are actually abusing the webservice by not re-using the HTTP Connections for more than one request and that the timeouts you experience are because your SYN FLOOD is being clamped.
Firefox and Opera are probably using between 4 and 8 connections to download all of the files.
If you redesign your code to re-use the connections you should achieve similar performance.

Related

Multiple concurrent connections with Vertx

I'm trying to build a web application that should be able to handle at least 15000 rps. Some of the optimizations I have done is increase the worker pool size to 20 and set an accept back log to 25000. Since I have set my worker pool size to 20; wil this help with the the blocking piece of code?
A worker pool size of 20 seems to be the default.
I believe the important question in your case is how long do you expect each request to run. On my side, I expect to have thousands of short-lived requests, each with a payload size of about 5-10KB. All of these will be blocking, because of a blocking database driver I use at the moment. I have increased the default worker pool size to 40 and I have explicitly set my deploy vertical instances using the following formulae:
final int instances = Math.min(Math.max(Runtime.getRuntime().availableProcessors() / 2, 1), 2);
A test run of 500 simultaneous clients running for 60 seconds, on a vert.x server doing nothing but blocking calls, produced an average of 6 failed requests out of 11089. My test payload in this case was ~28KB.
Of course, from experience I know that running my software in production would often produce results that I have not anticipated. Thus, the important thing in my case is to have good atomicity rules in place, so that I don't get half-baked or corrupted data in the database.

How does jmeter starts sending requests to server

If Thread: 100, Rampup: 1 and Loop count: 1 is the configuration, how will jmeter start sending requests to the server?
Request will be sent 1 req/sec or all requests will be sent all at once to server?
JMeter will send requests as fast as it can, to wit:
It will start all threads (virtual users) you define in Thread Group within the ramp-up period (in your case - 100 threads in 1 second)
Each thread (virtual user) will start executing Samplers which are present in the Thread Group upside down (or according to the Logic Controllers)
When there are no more samplers to execute or loops to iterate the thread will be shut down
When there are no more active threads left - JMeter test will end.
With regards to requests per second - it mostly depends on your application response time, i.e.
if you have 100 virtual users and response time is 1 second - you will get 100 requests/second
if you have 100 virtual users and response time is 2 seconds - you will get 50 requests/second
if you have 100 virtual users and response time is 500 milliseconds - you will get 200 requests/second
etc.
I would recommend increasing (and decreasing) the load gradually, this way you will be able to correlate increasing load with increasing throughput/response time/number of errors, etc. while releasing all threads at once will not tell you the full story (unless you're doing a form of spike testing, in this case consider using Synchronizing Timer)
JMeter's ramp-up period set as 1 means to start all 100 threads in 1 second.
This isn't recommended settings as describe below
The ramp-up period tells JMeter how long to take to "ramp-up" to the full number of threads chosen. If 10 threads are used, and the ramp-up period is 100 seconds, then JMeter will take 100 seconds to get all 10 threads up and running. Each thread will start 10 (100/10) seconds after the previous thread was begun. If there are 30 threads and a ramp-up period of 120 seconds, then each successive thread will be delayed by 4 seconds.
Ramp-up needs to be long enough to avoid too large a work-load at the start of a test, and short enough that the last threads start running before the first ones finish (unless one wants that to happen).
Start with Ramp-up = number of threads and adjust up or down as needed.
See also Can i set ramp up period 0 in JMeter?
bear in mind that with low rampup and many threads, you may be limited by local resources, so your results may be a measurement of client capability rather than server.

Why is UDP packet reception seemingly optimized mid-execution?

I'm running a client server configuration over Ethernet and measuring packet latency at both ends. The client (windows) is sending packets every 5 ms (confirmed with wire shark) as it should. Yet, the server (embedded linux) only receives packets at 5 ms intervals for a few seconds, at which point it stops for 300 ms. After this break the latency is only 20 us. After another period of about a few seconds it takes another break for 300 ms. This repeats indefinitely (300ms break, 20 us packet latency burst). It seems as if the server program is being optimized mid-execution to read IO in shorter bursts. Why is this happening?
Disclaimer: I haven't posted the code, as the client and server are small subsets of more complex applications, however, I am willing to factor it out if an obvious answer doesn't present itself.
This is UDP so there is no handshake or any flow control mechanism. Those 300 ms must be because of work the server is doing in the processing of the UDP messages received. During those 300 ms the server has surely lost around 60 messages that were not read from client.
You might probably want to check the server does not take more than 5 ms in processing each message if it uses one thread to process. If the server uses multi-threading to process the messages and the processing takes some time, even if it takes 1 ms, you might be in a situation where at some point all threads are competing for resources and they don't finish in time to read the next message. For the problem you are describing I would bet the server is multithreaded and you have that problem. I cannot assure that 100% for lack of info though. But in any case, you want to check the time it takes to process messages because you might be dealing with real-time requirements.
I spaced out the measurements to 1 in every 1000 packets and now it is behaving itself. I was using printf every 5ms which must've eventually filled the printf tx queue entirely. This then delayed the execution for 300ms. Once printf caught its breath, the program had a queue full of incoming packets and thus was seemingly receiving packets every 20 us.

What is the overhead traffic of a TCP connection (plus TCP clarifications)?

We have a TCP connection.
Nothing is sent over; how many traffic(bytes) are needed for each second to keep that connection open?
What is the duration of opening a connection from a client in South America to a server in North Europe?
If I have to send small amount of data (max 256bytes) at x seconds interval, what would be x for which is better to close the connection and reopen again instead of keeping the connection always open?
I do not expect exact data - estimates will suffice.
1) none.
2) some time. Try it and see. For a rough estimate, ping one end from the other and double it.
3) try it. It depends on bandwidth and, more importantly, latency. These vary over wide ranges. Usually, it's better, speed-wise, to keep connections open. 256 bytes at intervals of seconds? I would keep the connection open, especially over paths with possibly high latency, (eg. intercontinental).
1. According to the TCP/IP standard, nothing. However, depending on the network conditions and any middleboxes (NAT devices, firewalls, etc.), a connection with no data going over it may be dropped. That could be a staic timeout (say two minutes, or ten minutes, or an hour), or it could be based on a least-recently-used table in some device.
2. It depends on a lot of factors, and the biggest delay may be from the client's local network rather than the intercontinental connection. However, the surface of the earth between the points is about 40 light-millisenconds, so (without TCP Fast Open) that would be 120 ms for the first data packet to get from the client to the server and 40 ms for the response, 80 ms more than in an active connection.
3. Assuming no broken middleboxes, always better to keep the connection open. However, the delay to recover from a "silently dropped" connection may be a lot longer than the time to open a new one; it might be appropriate for the client to manage its own timeout (on hte order of a second or so), and open a new connection and retry the last message if it hasn't gotten a response by then. Depends on what you're sending; transactional messages might merit such explicit fast retry more than a remote copy of syslog.

Bandwidth save GPRS and TCP

Hello i have made a program for my old windows mobile phone to send gps data ,temperature etc every 5 seconds just for experimental reasons to create a fleet management system.
I noticed that within one hour 350kb were consumed although i sent only 20kb of data...
As i dont have deep knowledge in networks ,how much does a tcp connection cost in bytes?
Maybe i should keep the socket alive because i close and open it every 5 seconds.Would that save bytes?
Also does MTU matter here?
Any other idea to reduce the overhead?
thank you
Let's do some math here.
Every 5 seconds is 720 connections per hour plus data. 20K / 720 is about 28 bytes of payload (your GPS data) for each connection.
IP and TCP headers along are 48 bytes in addition to whatever data is being sent.
3-way handshake connection: 3 packets (2 out, 1 in) == 96 bytes out and 48 bytes in
Outbound Data-packet: 48+28 bytes == 76 bytes (out)
Inbound Ack: 48 bytes (in)
Close: 48 bytes (out)
Final Ack: 48 bytes (in)
Total out per connection: 220
Total in per connection: 144
Total data send/received per connection: 220+144 = 364
Total data usage in one hour = 364 * 720 = 262K
So I'm in the ballpark of your data usage estimates.
If you're looking to reduce bandwidth usage, here's three ideas:
Scale back on your update rate.
Don't tear down the socket connection each time. Just keep it open.
Given your GPS coordinates are periodically updated, you could consider using UDP instead of TCP. There's potential for packet loss, but given you're retransmitting fresher data every 5 seconds anyway, an update getting lost isn't worth the bandwidth to retransmit. IP and UDP headers combined are only 28 bytes with no "connection" overhead.
UPDATE
When I originally posted this, I erroneously misunderstood the connection close to be a single exchange of FIN packets between client and server. In practice, the client sends a FIN as part of it initiating the CLOSE. Then server ACKs the FIN. Then the server sends its own FIN that is ACK'd by the client. In other words, an additional 96 bytes per connection. Redoing our math:
Total data send/received per connection =
220+48 + 144+48 = 460
Total data usage in one hour = 460 * 720 = 331K
So my revised estimate of 331KB in one hour is a bit closer to what the OP saw.