I have a requirement to use locust to simulate 20,000 (and higher) users in a 10 minute test window.
the locustfile is a tasksquence of 9 API calls. I am trying to determine the ideal number of workers, and how many workers should be attached to an EC2 on AWS. My testing shows with 20 workers, on two EC2 instance, the CPU load is minimal. the master however suffers big time. a 4 CPU 16 GB RAM system as the master ends up thrashing to the point that the workers start printing messages like this:
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.util.exception_handler: Retry failed after 3 times.
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/ERROR/locust.runners: RPCError found when sending heartbeat: ZMQ sent failure
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.runners: Reset connection to master
the master seems memory exhausted as each locust master process has grown to 12GB virtual RAM. ok - so the EC2 has a problem. But if I need to test 20,000 users, is there a machine big enough on the planet to handle this? or do i need to take a different approach and if so, what is the recommended direction?
In my specific case, one of the steps is to download a file from CloudFront which is randomly selected in one of the tasks. This means the more open connections to cloudFront trying to download a file, the more congested the available network becomes.
Because the app client is actually a native app on a mobile and there are a lot of factors affecting the download speed for each mobile, I decided to to switch from a GET request to a HEAD request. this allows me to test the response time from CloudFront, where the distribution is protected by a Lambda#Edge function which authenticates the user using data from earlier in the test.
Doing this dramatically improved the load test results and doesn't artificially skew the other testing happening as with bandwidth or system resource exhaustion, every other test will be negatively impacted.
Using this approach I successfully executed a 10,000 user test in a ten minute run-time. I used 4 EC2 T2.xlarge instances with 4 workers per T2. The 9 tasks in test plan resulted in almost 750,000 URL calls.
The answer for the question in the title is: "It depends"
Your post is a little confusing. You say you have 10 master processes? Why?
This problem is most likely not related to the master at all, as it does not care about the size of the downloads (which seems to be the only difference between your test case and most other locust tests)
There are some general tips that might help:
Switch to FastHttpUser (https://docs.locust.io/en/stable/increase-performance.html)
Monitor your network usage (if your load gens are already maxing out their bandwidth or CPU then your test is very unrealistic anyway, and adding more users just adds to the noice. In general, start low and work your way up)
Increase the number of loadgens
In general, the number of users is not an issue for locust, but number of requests per second or bandwidth might be.
As per my understanding when we want to share the particular part of author and publish,we use binary less replication. But what can be the use case,where i should use binary less replication.
I want to know the best practices of binary less replication.
Binaryless replication or shared data store works on the basis that binaries are not copied across datastores. Only the metadata is replicated or transferred between the instances. The setup can be applied between authors and publishers. Alternatively data store can be shared between author instances also in a cold standby setup. It has 3 major use cases:
When you are dealing with very large DAM assets (high res images or videos), any replication involving binary copies over network is very costly. Binaryless is shared data store so binaries are not copied and you save on internal network traffic. It saves time and cost for some setup.
When you have lots of publishers, binary copy can bottleneck your author network. This reduces the load of that transfer and publishers can be scaled without impacting network usage exponentially.
TarMK cold standby has a limit of 2GB binary sync transfer across primary and standby standalone data stores. Binaryless (or shared datastore) is the only workaround for this limit.
For very large datastores you also save time in backups and restores as there is only one store as opposed to 2 stores for author and publishers.
I was recently asked this question in an interview. Lets suppose I have 2000 servers. I want to transfer a 5GB file to all these servers from a centralized server. Come up with an efficient algorithm.
My response:
I will use perl/python to scp the file over from the centralized server to the first server.
In parallel, I will also start sending files to other servers. I feel doing one by one is very inefficient hence doing in parallel would speed up.
Is there a better way to do this ?
Sure, you would use some sort of script, since you don't want to do that manually.
But instead of sending all the files from one server to all the others, you would start sending the file to k Servers. As soon as these k Servers received the file (let's say at time t), they can start distributing the file too, so after approx. time 2*t already k^2 servers have the file instead of 2*k in the original solution. After time 3*t already k^3 Servers have got the file... You continue with that algorithm until every server has got it's file.
To make the whole process yet a bit faster, you could also divide the file in chunks, so that a server can start redistributing it before it has received the whole file (you will end up with something like torrent)
Definitely "torrent" is the best and proven strategy for load-balancing in this scenario. But I think when an interview asks such hypothetical question to me, she is probably also looking for your assumptions and expecting counter questions.
upload / download capacity of servers.
network localization, i.e how many hops are different machines.
can the file be archived and send
how to verify integrity (md5 hash)
Now my scheme remains the same "torrent" thanks to #Misch. But if all servers are on same n/w and are of same capacity then;
Divide file into 2000 parts, each server gets 5GB/2000 ~ 2.5 MB (file segment) to host and central acts as a beacon server to tell other server where the files are.
Each server would download these chunks in random order from other server, if we download sequentially then it causes bottleneck on one machine.
Depending on machine we can have max active upload/download threads, each thread up/down separate file segment. when a server is serving maximum hosts, it can reject file download request. Requesting host would simple pickup another random segment to download.
We can use some checksum for individual file segment & all files combined, to verify file integrity.
This ensures that all servers are upload/downloading close to their up/down stream bandwidth. But obviously in a real world I can have a secured torrent and just use that instead.
If you split the file into tiny chunks, then each server can begin transferring the chunks that it has received before the entire file has even been downloaded. This is basically the algorithm that bittorrent uses, and is MUCH (i.e. asymptotically) faster than having the servers send the file only after it has received the whole thing.
In fact, with an infinitesimally small chunk size (i.e. the purely theoretical case), the time it takes to distribute a file of size m to n servers doesn't even depend on the value of n -- only on the size of the file being distributed (i.e. O(m)). Of course, in the practical case there are some overheads/details to consider (which d1val summarized nicely) which make it take slightly longer in practice.
Conversely, if you have each server upload the file to another server only after it has received the whole file, then the running time is O(m log(n)) -- which is asymptotically larger than the bittorrent approach.
Also, just to add, usually when an interview asks this kind of question, he/she is asking about the algorithm, not so much the implementation details.
I was asked a similar kind of question where in the torrent way of doing things was not accepted.
The question was "If microsoft has to push a software update to 2000 servers it has across US then how would it do it"- So these servers are not capable of doing torrent based file transfer.
My Answer was:
From the main server with a list of 2000 nodes have a batching process, the capacity of the batch will be determined by the network speed you have across to these nodes.
So First select a sample of 100 nodes and do a speedtest across these node. A speed test will give an indication of what is the median speed which is available across these 100 nodes and may be that acts as a sample to the entire network.
So now you have a value of X Mbps is the speed at which you can do a transfer across to these nodes.
Look at the capcity of your own outgoing data speed. So if the central server has a capacity of YGbps as its upload speed
then the batching size = Your Upload Capacity (Y)/ X(speed found by speedtest).
According to this batching size you move ahead in transferring parallelly to 2000 server in batches.
Any Inputs ?
I guess you could put the file on NFS server and have your hosts mount that NFS partition.
We need to read and count different types of messages/run
some statistics on a 10 GB text file, e.g a FIX engine
log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but
the language doesn't really matter.
I have found some interesting tips in Tim Bray's
WideFinder project. However, we've found that using memory mapping
is inherently limited by the 32 bit architecture.
We tried using multiple processes, which seems to work
faster if we process the file in parallel using 4 processes
on 4 CPUs. Adding multi-threading slows it down, maybe
because of the cost of context switching. We tried changing
the size of thread pool, but that is still slower than
simple multi-process version.
The memory mapping part is not very stable, sometimes it
takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from
page faults or something related to virtual memory usage.
Anyway, Mmap cannot scale beyond 4 GB on a 32 bit
architecture.
We tried Perl's IPC::Mmap and Sys::Mmap. Looked
into Map-Reduce as well, but the problem is really I/O
bound, the processing itself is sufficiently fast.
So we decided to try optimize the basic I/O by tuning
buffering size, type, etc.
Can anyone who is aware of an existing project where this
problem was efficiently solved in any language/platform
point to a useful link or suggest a direction?
Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).
Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.
This all depends on what kind of preprocessing you can do and and when.
On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files
until hours after they're created, and at creation time we don't really have any other load on the machines.
Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.
Perhaps you've already read this forum thread, but if not:
http://www.perlmonks.org/?node_id=512221
It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.
Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.
Best of luck.
I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.
PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.
Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).
Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.
Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.
Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.
I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.
hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit,
so just call it 5 times in sequence. That should be fairly fast.
If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.
If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).
Since you said platform and language doesn't matter...
If you want a stable performance that is as fast as the source medium allows for, the only way I am aware that this can be done on Windows is by overlapped non-OS-buffered aligned sequential reads. You can probably get to some GB/s with two or three buffers, beyond that, at some point you need a ring buffer (one writer, 1+ readers) to avoid any copying. The exact implementation depends on the driver/APIs. If there's any memory copying going on the thread (both in kernel and usermode) dealing with the IO, obviously the larger buffer is to copy, the more time is wasted on that rather than doing the IO. So the optimal buffer size depends on the firmware and driver. On windows good values to try are multiples of 32 KB for disk IO. Windows file buffering, memory mapping and all that stuff adds overhead. Only good if doing either (or both) multiple reads of same data in random access manner. So for reading a large file sequentially a single time, you don't want the OS to buffer anything or do any memcpy's. If using C# there's also penalties for calling into the OS due to marshaling, so the interop code may need bit of optimization unless you use C++/CLI.
Some people prefer throwing hardware at problems but if you have more time than money, in some scenarios it's possible to optimize things to perform 100-1000x better on a single consumer level computer than a 1000 enterprise priced computers. The reason is that if the processing is also latency sensitive, going beyond using two cores is probably adding latency. This is why drivers can push gigabytes/s while enterprise software is ends stuck at megabytes/s by the time it's all done. Whatever reporting, business logic and such the enterprise software do can probably also be done at gigabytes/s on two core consumer CPU, if written like you were back in the 80's writing a game. The most famous example I've heard of approaching their entire business logic in this manner is the LMAX forex exchange, which published some of their ring buffer based code, which was said to be inspired by network card drivers.
Forgetting all the theory, if you are happy with < 1 GB/s, one possible starting point on Windows I've found is looking at readfile source from winimage, unless you want to dig into sdk/driver samples. It may need some source code fixes to calculate perf correctly at SSD speeds. Experiment with buffer sizes also.
The switches /h multi-threaded and /o overlapped (completion port) IO with optimal buffer size (try 32,64,128 KB etc) using no windows file buffering in my experience give best perf when reading from SSD (cold data) while simultaneously processing (use the /a for Adler processing as otherwise it's too CPU-bound).
I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.
Hope it helps
Its not stated in the problem that sequence matters really or not. So,
divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.
I am opening files using memory map. The files are apparently too big (6GB on a 32-bit PC) to be mapped in one ago. So I am thinking of mapping part of it each time and adjusting the offsets in the next mapping.
Is there an optimal number of bytes for each mapping or is there a way to determine such a figure?
Thanks.
There is no optimal size. With a 32-bit process, there is only 4 GB of address space total, and usually only 2 GB is available for user mode processes. This 2 GB is then fragmented by code and data from the exe and DLL's, heap allocations, thread stacks, and so on. Given this, you will probably not find more than 1 GB of contigous space to map a file into memory.
The optimal number depends on your app, but I would be concerned mapping more than 512 MB into a 32-bit process. Even with limiting yourself to 512 MB, you might run into some issues depending on your application. Alternatively, if you can go 64-bit there should be no issues mapping multiple gigabytes of a file into memory - you address space is so large this shouldn't cause any issues.
You could use an API like VirtualQuery to find the largest contigous space - but then your actually forcing out of memory errors to occur as you are removing large amounts of address space.
EDIT: I just realized my answer is Windows specific, but you didn't which platform you are discussing. I presume other platforms have similar limiting factors for memory-mapped files.
Does the file need to be memory mapped?
I've edited 8gb video files on a 733Mhz PIII (not pleasant, but doable).