Can a slow consumer kill or slow down tickerplant?
I have a ticker plant which has 3 realtime subscribers of which one subscriber is slow.
q).z.W
7 | `long$()
8 | 969393 198 198 197 197 198 196 199 197 196 143 198 196 196 197 197 198 19..
9 | 199 198 198 143 197 199 197 197 197 197 199 196 199 145 196 198 198 198 1..
10| 198 196 198 144 199 198 198 198 196 197 196 199 198 143 199 198 197 198 1..
q)count each .z.W
7 | 0
8 | 85547
9 | 77931
10| 0
q)count each .z.W
7 | 0
8 | 191552
9 | 0
10| 0
Can the slow consumer get tickerplant killed or slow it down in production kdb+ systems receiving billions of records?
Yes, a slow consumer can kill a tickerplant. The slow consumer creates the output queue in the tickerplant and this output queue consumes memory. Eventually if it goes on long enough the tickerplant (or the machine it's running on) can run out of memory and abort.
Ideally a production tickerplant will have some form of monitoring which periodically keeps an eye on the output queues - if a queue gets beyond a certain threshold it should halt the subscription (temporarily remove the handle from the .u.w subscription dictionary, allow the queue to drain) and resume if/when the subscriber catches up. Or be more aggressive and close the subscribers connection entirely (hclose) which wipes the output queue.
If your system experiences a lot of queueing then the tickerplant might also need a daily garbage collect too (say at EOD) to make sure that output queues haven't caused it to hold unused memory (or maybe you want to keep it with unused memory so that it doesn't have to re-request memory from the OS next time there's a big queue)
Terrys answer is 100% correct. I just want to expand on the need for garbage collection in a TP with slow subscribers.
It's important that this collection is implemented via .Q.gc[] not by immediate collection \g 1. The immediate collection is only triggered when all references to an object are dropped and the object is returned to heap, if it is greater than 64MB then collection triggers.
During normal execution in a TP, the published data is never a referenced object, it is the parameters of an incoming message that is then published out. Because of this, there is no object de-referencing that triggers the automatic garbage collection.
Related
The context
I have an infrastructure where a server produces long running jobs where each job consists of logical chunks that are about the same size, but every job have vastly different amount of chunks. I have a scalable number of workers which can take the chunks do the work (processor heavy), and return the result to the server. One worker works on only one chunk at the same time.
Currently for scheduling the chunks I use an SQS queue so when a job is created I dump all chunks to the SQS queue and the workers will take the chunks. It works like a FIFO.
So to summarize what does what:
A job is a lot of processor intensive calculations. It consists of multiple independent chunks that are about the same size.
A chunk is a processor intensive calculation the workers can work on. Independent of other chunks and can be calculated itself without additional context.
The server creates jobs. When the job is created the server puts all the job's chunks on the Queue (and essentially forgets about the jobs).
The workers can work on chunks. It does not matter what job is the chunk part of, the worker can take on any. A worker when it has nothing to work on (is newly created, or already finished its previous chunk) looks for the next chunk on the queue.
The problem
When a job is scheduled all chunks are added to the queue and when a next job is scheduled it will not be started to be worked on until the first job is finished. So in a scenario where job A (first) takes 4 hours and job B (second) takes 5 minutes, job B will not get started in the first few hours and will only be finished in about 4 hours 5 minutes, so if there is a large job scheduled it will effectively block all other calculations. The queue will look like this:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 ... A100 B1 B2
I would like to not block the new calculations coming in but process them in a different order like:
A1 B1 A2 B2 A3 A4 A5 A6 A7 A8 A9 A10 ... A100
If a third job arrives after A1 and B1 has been picked up, it should still not be blocked:
A2 B2 C1 A3 C2 A4 C3 A5 C4 A6 A7 A8 A9 A10 ... A100
With ordering the chunks like this I can guarantee the following:
For every job the first task is picked up relatively fast.
For every job there is constant perceived progress (some new chunks are always finished)
Short jobs (not many chunks) are finished relatively fast.
Solutions
I know I cannot reorder an SQS queue in itself, so I might have to do something like:
Change technologies. Maybe some queue supports this out of the box in AWS
When a new job is about to be scheduled, the server just takes all chunks from the queue, shuffles in the new chunks, puts back everything in the queue.
Somehow reach the intended behavior with a priority queue (maybe RabbitMQ).
Is there some easy, safe solution for this? How should I do it?
First off, please forgive my complete lack of knowledge in this field. I got some time ago a couple of really cheap wifi ip cameras that I was hoping to use on my local network. Unfortunately, it turned out the manufacturer (zmodo) had introduced about a year ago their own streaming protocol and had done away with the rtsp port present in previous models. The manufacturer supplied apps are only cloud based, which doesn't put me at ease. Luckily, people have been trying to come up with solutions to this and have been fairly successful with older models. So I tried to see what I can do myself.
In essence, the camera listens on port 8000 for 12 byte long commands. One of these commands triggers a stream of data. The data looks a little bit like a sequence of AVI chunks that start with either 00dc or 01dc. Just like in AVI, these tags are followed by a four-byte size (let's call it S) and 24 bytes that look like a time stamp (but could also contain more info). The total number of bytes between 00dc or 01dc tags is exactly 32 bytes more than S. So, we have something like this
30 30 64 63 S[0] S[1] S[2] S[3] ....24 bytes for a time stamp? ... S bytes of data that seems encrypted...
30 31 64 63 S[0] S[1] S[2] S[3] ....24 bytes for a time stamp? ... 00 00 00 01 61 .... S - 5 bytes of data...
The second type of chunk, prefixed by 01dc seems to contain unencrypted B- or P-frame NAL units (if my basic understanding is correct), but my question is about the first type. The first type probably contains I-frames but I cannot find a NAL delimeter or type byte. Additionally, the first 32 bytes of the S bytes of actual data are the same for every unit, i.e. I have
30 30 64 63 S[0] S[1] S[2] S[3] ....24 bytes for a time stamp? ... 32 constant bytes ... S-32 bytes of data
The camera also provides a 128 bit key which I naturally assumed to be for stream encryption (although it could be for cloud authentication or TLS or who knows what). So my question is, in your experience, which (encrypted) version of h.264 is this closest to? Any idea what the 32 constant bytes are?
EDIT: I am not sure that the keyframes are encrypted, I am only speculating. Actually, they seem to contain large areas of constant bytes at the end (although the bytes change from one unit to the next, i.e. e4 e4 .... e4 in one, then 93 93 .... 93, which rules out AES I think)
Hello i have made a program for my old windows mobile phone to send gps data ,temperature etc every 5 seconds just for experimental reasons to create a fleet management system.
I noticed that within one hour 350kb were consumed although i sent only 20kb of data...
As i dont have deep knowledge in networks ,how much does a tcp connection cost in bytes?
Maybe i should keep the socket alive because i close and open it every 5 seconds.Would that save bytes?
Also does MTU matter here?
Any other idea to reduce the overhead?
thank you
Let's do some math here.
Every 5 seconds is 720 connections per hour plus data. 20K / 720 is about 28 bytes of payload (your GPS data) for each connection.
IP and TCP headers along are 48 bytes in addition to whatever data is being sent.
3-way handshake connection: 3 packets (2 out, 1 in) == 96 bytes out and 48 bytes in
Outbound Data-packet: 48+28 bytes == 76 bytes (out)
Inbound Ack: 48 bytes (in)
Close: 48 bytes (out)
Final Ack: 48 bytes (in)
Total out per connection: 220
Total in per connection: 144
Total data send/received per connection: 220+144 = 364
Total data usage in one hour = 364 * 720 = 262K
So I'm in the ballpark of your data usage estimates.
If you're looking to reduce bandwidth usage, here's three ideas:
Scale back on your update rate.
Don't tear down the socket connection each time. Just keep it open.
Given your GPS coordinates are periodically updated, you could consider using UDP instead of TCP. There's potential for packet loss, but given you're retransmitting fresher data every 5 seconds anyway, an update getting lost isn't worth the bandwidth to retransmit. IP and UDP headers combined are only 28 bytes with no "connection" overhead.
UPDATE
When I originally posted this, I erroneously misunderstood the connection close to be a single exchange of FIN packets between client and server. In practice, the client sends a FIN as part of it initiating the CLOSE. Then server ACKs the FIN. Then the server sends its own FIN that is ACK'd by the client. In other words, an additional 96 bytes per connection. Redoing our math:
Total data send/received per connection =
220+48 + 144+48 = 460
Total data usage in one hour = 460 * 720 = 331K
So my revised estimate of 331KB in one hour is a bit closer to what the OP saw.
I'm trying to understand the performance numbers I'm getting and how to determine the optimal number of threads.
See the bottom of this post for my results
I wrote an experimental multi-threaded web client in perl which downloads a page, grabs the source for each image tag and downloads the image - discarding the data.
It uses a non-blocking connect with an initial per file timeout of 10 seconds which doubles after each timeout and retry. It also caches IP addresses so each thread only has to do a DNS lookup once.
The total amount of data downloaded is 2271122 bytes in 1316 files via 2.5Mbit connection from http://hubblesite.org/gallery/album/entire/npp/all/hires/true/ . The thumbnail images are hosted by a company which claims to specialize in low latency for high bandwidth applications.
Wall times are:
1 Thread takes 4:48 -- 0 timeouts
2 Threads takes 2:38 -- 0 timeouts
5 Threads takes 2:22 -- 20 timeouts
10 Threads take 2:27 -- 40 timeouts
50 Threads take 2:27 -- 170 timeouts
In the worst case ( 50 threads ) less than 2 seconds of CPU time are consumed by the client.
avg file size 1.7k
avg rtt 100 ms ( as measured by ping )
avg cli cpu/img 1 ms
The fastest average download speed is 5 threads at about 15 KB / sec overall.
The server actually does seem to have pretty low latency as it takes only 218 ms to get each image meaning it takes only 18 ms on average for the server to process each request:
0 cli sends syn
50 srv rcvs syn
50 srv sends syn + ack
100 cli conn established / cli sends get
150 srv recv's get
168 srv reads file, sends data , calls close
218 cli recv HTTP headers + complete file in 2 segments MSS == 1448
I can see that the per file average download speed is low because of the small file sizes and the relatively high cost per file of the connection setup.
What I don't understand is why I see virtually no improvement in performance beyond 2 threads. The server seems to be sufficiently fast, but already starts timing out connections at 5 threads.
The timeouts seem to start after about 900 - 1000 successful connections whether it's 5 or 50 threads, which I assume is probably some kind of throttling threshold on the server, but I would expect 10 threads to still be significantly faster than 2.
Am I missing something here?
EDIT-1
Just for comparisons sake I installed the DownThemAll Firefox extension and downloaded the images using it. I set it to 4 simultaneous connections with a 10 second timeout. DTM took about 3 minutes to download all the files + write them to disk, and it also started experiencing timeouts after about 900 connections.
I'm going to run tcpdump to try and get a better picture what's going on at the tcp protocol level.
I also cleared Firefox's cache and hit reload. 40 Seconds to reload the page and all the images. That seemed way too fast - maybe Firefox kept them in a memory cache which wasn't cleared? So I opened Opera and it also took about 40 seconds. I assume they're so much faster because they must be using HTTP/1.1 pipelining?
And the Answer Is!??
So after a little more testing and writing code to reuse the sockets via pipelining I found out some interesting info.
When running at 5 threads the non-pipelined version retrieves the first 1026 images in 77 seconds but takes a further 65 seconds to retrieve the remaining 290 images. This pretty much confirms MattH's theory about my client getting hit by a SYN FLOOD event causing the server to stop responding to my connection attempts for a short period of time. However, that is only part of the problem since 77 seconds is still very slow for 5 threads to get 1026 images; if you remove the SYN FLOOD issue it would still take about 99 seconds to retrieve all the files. So based on a little research and some tcpdump's it seems like the other part of the issue is latency and the connection setup overhead.
Here's where we get back to the issue of finding the "Sweet Spot" or the optimal number of threads. I modified the client to implement HTTP/1.1 Pipelining and found that the optimal number of threads in this case is between 15 and 20. For example:
1 Thread takes 2:37 -- 0 timeouts
2 Threads takes 1:22 -- 0 timeouts
5 Threads takes 0:34 -- 0 timeouts
10 Threads take 0:20 -- 0 timeouts
11 Threads take 0:19 -- 0 timeouts
15 Threads take 0:16 -- 0 timeouts
There are four factors which
affect this; latency / rtt , maximum end-to-end bandwidth, recv buffer size
and the size of the image files being downloaded. See this site for a
discussion on how receive buffer size and RTT latency affect available
bandwidth.
In addition to the above, average file size affects the maximum per connection
transfer rate. Every time you issue a GET request you create an empty gap in
your transfer pipe which is the size of the connection RTT. For example, if
you're Maximum Possible Transfer Rate ( recv buff size / RTT ) is 2.5Mbit and
your RTT is 100ms, then every GET request incurs a minimum 32kB gap in your
pipe. For a large average image size of 320kB that amounts to a 10% overhead
per file, effectively reducing your available bandwidth to 2.25Mbit. However,
for a small average file size of 3.2kB the overhead jumps to 1000% and
available bandwidth is reduced to 232 kbit / second - about 29kB.
So to find the optimal number of threads:
Gap Size = MPTR * RTT
MPTR / (MPTR / Gap Size + AVG file size) * AVG file size)
For my above scenario this gives me an optimum thread count of 11 threads, which is extremely close to my real world results.
If the actual connection speed is slower than the theoretical MPTR then it
should be used in the calculation instead.
Please correct me this summary is incorrect:
Your multi-threaded client will start a thread that connects to the server and issues just one HTTP GET then that thread closes.
When you say 1, 2, 5, 10, 50 threads, you're just referring to how many concurrent threads you allow, each thread itself only handles one request
Your client takes between 2 and 5 minutes to download over 1000 images
Firefox and Opera will download an equivalent data set in 40 seconds
I suggest that the server rate-limits http connections, either by the webserver daemon itself, a server-local firewall or most likely dedicated firewall.
You are actually abusing the webservice by not re-using the HTTP Connections for more than one request and that the timeouts you experience are because your SYN FLOOD is being clamped.
Firefox and Opera are probably using between 4 and 8 connections to download all of the files.
If you redesign your code to re-use the connections you should achieve similar performance.
A presentation by Mikhael Goikhman from a 2003 Perl conference includes a pair of examples of prime-number-finding scripts. One is threaded, and the other is not. Upon running the scripts (print lines commented out), I got an execution time of 0.011s on the non-threaded one, and 2.343 (!) seconds on the threaded version. What accounts for the stunning difference in times?
I have some experience with threads in Perl and have noticed before that thread creation times can be particularly brutal, but this doesn't seem to be the bottleneck in Goikham's example.
Jay P. is right:
~$ strace -c ./threads.pl
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.80 0.116007 10546 11 futex
0.20 0.000229 6 36 mmap2
0.00 0.000000 0 31 read
0.00 0.000000 0 49 13 open
0.00 0.000000 0 36 close
Compare that with:
~$ strace -c ./no-threads.pl
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
90.62 0.000261 261 1 execve
9.38 0.000027 0 167 write
0.00 0.000000 0 12 read
0.00 0.000000 0 38 13 open
0.00 0.000000 0 25 close
I'm a Python guy, not Perl, so I only have a vague idea of what the code is doing. However, always be careful when you see Queues. Python has a thread-safe Queue, and it looks like Perl does too. They're fantastic in that they take care of thread-safety for you, but they typically involve lots of expensive locking and unlocking of the queue, which is probably where all your time is going.
How many processors do you have? In general, any calculation intensive task will be slower when # of threads > # of processors. This is because it is expensive to switch between threads ("context switch"). Context switches involve stopping 1 thread, saving its context, then putting in another thread's context into the processor so it can run. And all for what? So thread A can calculate if 12321 is divisible by 7 instead of thread B?
If you have 2 procs, I would bet that a version with 2 threads might be the fastest, 4 procs -> use 4 threads, etc.
It's a bit of a pathological case. The real answer is: Before starting to use Perl ithreads, you need to know a bit about how things work. They're notoriously inefficient at some things (sharing data) and good at others (they're concurrent).
If the chunks of work that you let the sub-threads do would be increased by a significant about in comparison to the number of times you send data from one thread to another, things would look quite different.
Comparing with Python threads like Jay P: As he correctly states, Python threads are cooperative and run on one core only. Perl's ithreads are very different. They can run on a core each, but being able to do this is paid for with having essentially a separate interpreter per thread. That makes communication between threads similar to inter-process communication including the associated overhead.