Efficient way to transfer files to 1000s of servers - file-transfer

I was recently asked this question in an interview. Lets suppose I have 2000 servers. I want to transfer a 5GB file to all these servers from a centralized server. Come up with an efficient algorithm.
My response:
I will use perl/python to scp the file over from the centralized server to the first server.
In parallel, I will also start sending files to other servers. I feel doing one by one is very inefficient hence doing in parallel would speed up.
Is there a better way to do this ?

Sure, you would use some sort of script, since you don't want to do that manually.
But instead of sending all the files from one server to all the others, you would start sending the file to k Servers. As soon as these k Servers received the file (let's say at time t), they can start distributing the file too, so after approx. time 2*t already k^2 servers have the file instead of 2*k in the original solution. After time 3*t already k^3 Servers have got the file... You continue with that algorithm until every server has got it's file.
To make the whole process yet a bit faster, you could also divide the file in chunks, so that a server can start redistributing it before it has received the whole file (you will end up with something like torrent)

Definitely "torrent" is the best and proven strategy for load-balancing in this scenario. But I think when an interview asks such hypothetical question to me, she is probably also looking for your assumptions and expecting counter questions.
upload / download capacity of servers.
network localization, i.e how many hops are different machines.
can the file be archived and send
how to verify integrity (md5 hash)
Now my scheme remains the same "torrent" thanks to #Misch. But if all servers are on same n/w and are of same capacity then;
Divide file into 2000 parts, each server gets 5GB/2000 ~ 2.5 MB (file segment) to host and central acts as a beacon server to tell other server where the files are.
Each server would download these chunks in random order from other server, if we download sequentially then it causes bottleneck on one machine.
Depending on machine we can have max active upload/download threads, each thread up/down separate file segment. when a server is serving maximum hosts, it can reject file download request. Requesting host would simple pickup another random segment to download.
We can use some checksum for individual file segment & all files combined, to verify file integrity.
This ensures that all servers are upload/downloading close to their up/down stream bandwidth. But obviously in a real world I can have a secured torrent and just use that instead.

If you split the file into tiny chunks, then each server can begin transferring the chunks that it has received before the entire file has even been downloaded. This is basically the algorithm that bittorrent uses, and is MUCH (i.e. asymptotically) faster than having the servers send the file only after it has received the whole thing.
In fact, with an infinitesimally small chunk size (i.e. the purely theoretical case), the time it takes to distribute a file of size m to n servers doesn't even depend on the value of n -- only on the size of the file being distributed (i.e. O(m)). Of course, in the practical case there are some overheads/details to consider (which d1val summarized nicely) which make it take slightly longer in practice.
Conversely, if you have each server upload the file to another server only after it has received the whole file, then the running time is O(m log(n)) -- which is asymptotically larger than the bittorrent approach.
Also, just to add, usually when an interview asks this kind of question, he/she is asking about the algorithm, not so much the implementation details.

I was asked a similar kind of question where in the torrent way of doing things was not accepted.
The question was "If microsoft has to push a software update to 2000 servers it has across US then how would it do it"- So these servers are not capable of doing torrent based file transfer.
My Answer was:
From the main server with a list of 2000 nodes have a batching process, the capacity of the batch will be determined by the network speed you have across to these nodes.
So First select a sample of 100 nodes and do a speedtest across these node. A speed test will give an indication of what is the median speed which is available across these 100 nodes and may be that acts as a sample to the entire network.
So now you have a value of X Mbps is the speed at which you can do a transfer across to these nodes.
Look at the capcity of your own outgoing data speed. So if the central server has a capacity of YGbps as its upload speed
then the batching size = Your Upload Capacity (Y)/ X(speed found by speedtest).
According to this batching size you move ahead in transferring parallelly to 2000 server in batches.
Any Inputs ?

I guess you could put the file on NFS server and have your hosts mount that NFS partition.

Related

determine ideal number of workers and EC2 sizing for master

I have a requirement to use locust to simulate 20,000 (and higher) users in a 10 minute test window.
the locustfile is a tasksquence of 9 API calls. I am trying to determine the ideal number of workers, and how many workers should be attached to an EC2 on AWS. My testing shows with 20 workers, on two EC2 instance, the CPU load is minimal. the master however suffers big time. a 4 CPU 16 GB RAM system as the master ends up thrashing to the point that the workers start printing messages like this:
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.util.exception_handler: Retry failed after 3 times.
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/ERROR/locust.runners: RPCError found when sending heartbeat: ZMQ sent failure
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.runners: Reset connection to master
the master seems memory exhausted as each locust master process has grown to 12GB virtual RAM. ok - so the EC2 has a problem. But if I need to test 20,000 users, is there a machine big enough on the planet to handle this? or do i need to take a different approach and if so, what is the recommended direction?
In my specific case, one of the steps is to download a file from CloudFront which is randomly selected in one of the tasks. This means the more open connections to cloudFront trying to download a file, the more congested the available network becomes.
Because the app client is actually a native app on a mobile and there are a lot of factors affecting the download speed for each mobile, I decided to to switch from a GET request to a HEAD request. this allows me to test the response time from CloudFront, where the distribution is protected by a Lambda#Edge function which authenticates the user using data from earlier in the test.
Doing this dramatically improved the load test results and doesn't artificially skew the other testing happening as with bandwidth or system resource exhaustion, every other test will be negatively impacted.
Using this approach I successfully executed a 10,000 user test in a ten minute run-time. I used 4 EC2 T2.xlarge instances with 4 workers per T2. The 9 tasks in test plan resulted in almost 750,000 URL calls.
The answer for the question in the title is: "It depends"
Your post is a little confusing. You say you have 10 master processes? Why?
This problem is most likely not related to the master at all, as it does not care about the size of the downloads (which seems to be the only difference between your test case and most other locust tests)
There are some general tips that might help:
Switch to FastHttpUser (https://docs.locust.io/en/stable/increase-performance.html)
Monitor your network usage (if your load gens are already maxing out their bandwidth or CPU then your test is very unrealistic anyway, and adding more users just adds to the noice. In general, start low and work your way up)
Increase the number of loadgens
In general, the number of users is not an issue for locust, but number of requests per second or bandwidth might be.

How many parallel processes?

I am running some code in parallel by using a forking module in perl called Parallel::ForkManager. I have currently setting the maximum number of processes to 30:
my $pm = Parallel::ForkManager->new(30);
What would be an advisable maximum number of processes to create? I am doing this on a commercial grade Solaris server, but I still don't want to overload the system.
In downloading files, this really depends on
how many different hosts you're downloading from, and
how fast they will give you the requested files compared to your maximum bandwidth.
If you're downloading files from a single machine to a single machine on a local network, 2-3 is about max. If you're downloading files from 30 different servers on the internet, all of which are slow, but you have a fat pipe, then 30 might be reasonable.
There is no one universal right answer here. Unless you count "it depends."
The purpose of "downloading files" was mentioned, but in comments a while ago and I take the question as stated, to also be more general.
The only relevant measure is when you start reaching saturation in performance gains, with particular software on that system. The formal limits are huge and meaningless while rules of thumb are very general.
Let's imagine to run 10 processes and the time to complete the job drops 10 times. Increase to 20 processes and the time drops 20 times -- but for 30 processes the gain is the factor of 10. At this point we have loaded the system. Push further and the performance will degrade rapidly, and for everyone. At that point the server is overloaded, even though it allows, say, 1024 processes per user (and really ten or more times that for a server).
With a few processes per core the machine is engaged and I'd say that that is a good rule of thumb. However, it is too general. I doubt that you'd gain much in performance by going to that many processes, given the many other factors that affect it.
Accessing one web server The server's capability is the gospel. They may have posted how many requests per seconds they are happy with. Or they may have a limit on number of processes per user, say 10 or 20. If that means that many simultaneous downloads then that's your limit. But I'd be careful -- if the site is close and fast a request may complete in as little as 0.1 or 0.2 seconds. Then, with 10 processes you may be hitting the server 100 times a second. I do not recommend that. If there is no information I'd say keep it to a few requests per second. The performance and server load also depend on the content -- big downloads are different from pulling many skinny web pages. The I/O on your side may matter but I'd expect the server to set the limit. If you are going to use their service a lot why not send an email and ask what they are OK with.
I/O, network (many servers) or disk With network the performance depends on every piece of hardware in the path as well as on software. Nobody can tell without trying it out. The disk I/O is very complex. To add to trouble it is unclear whether it'd be your disks or network that is the bottleneck. I'd expect clear performance gains up to a few tens of processes, and probably fewer.
CPU or memory bound This may be easiest -- processing that can be broken up in parallel on 30 cores can enjoy close to a factor of 30 speedup (given no other bottlenecks). Going beyond the number of cores clearly leads to reduced performance gain. Concurrent (but not parallel) processing is far more complicated. If your code is memory intensive that is yet completely different.
Useful basic tools for assessing above components are iostat -xzn, netstat -I, and vmstat. But there is a bit of a curve to learning how to interpret their output and hopefully it doesn't come to that.
The conclusion is that you have to time it. Take your real application and time it running in one process. Do this 3 to 5 times and see the average (throw away obvious outliers). Then repeat with 5 processes, then with 10, etc. I'd expect that the trend will start slowing down far sooner than the 30 processors you mention. Once it gets to that the system is loaded and whoever works on it will notice. Very soon after that the performance will likely degrade rapidly. Proper benchmarking tools, like Benchmark, are far more sophisticated but this may well settle the issue. If you see strange or inconsistent behavior you may have to dig into details, starting with tools mentioned above.
What "overloaded" means is a bit unclear. I like to cap my use of resources well before other people are affected. But it may be possible to push it, in particular if you can run when it's quiet. I doubt that you'll keep having a worthy gain all the way to the number of available processors.
So there is no concern about "overloading" the server if you first time things. The performance limit will tell you when to stop. I'd say that your limit of 30 is very reasonable. Unless this is really about downloading files, in which case the web server is likely all that matters.
You should set the maximum number of processes to 60.

How to make APC cache based on distributed hash tables(like memcache)?

I've read an article about Distributed Hash Tables and seems it's possible to implement such a thing like memcache with APC. As you know APC is much more faster than memcache if we're fetching keys from a single server. So if we make APC distributed we have both performance and distribution. I need some thoughts to start it. Could someone who is familiar with Hash tables explain how to do that? How to make APC like memcache?
If you know something about keyspace partitioning and Overlay network that would be much more better.
Although at the surface both softwares provide a comparable service, their underpinnings are entirely different, and that explains the dramatic difference in performance.
APC is basically a system that allows you to store objects (be it user objects or parsed opcode chunks) in shared memory. Shared memory, in all systems I know, is as fast as local RAM once you obtained a pointer to it.
So, in short, what APC has to do to write or read an object is:
request shm access and obtain a pointer to it
calculate object offset and size in the shm
memcpy that memory zone into a buffer or vice versa
done
Simple, and taking into account that memory bandwidth nowadays is 10's of gigabytes per second, quick.
Due to its distributed nature in a memcache scenario more needs to be done:
client encodes and transmits request
server receives and decodes request
server calculates object offset and size in memcached's memory
server memcpy's that memory zone into a buffer or vice versa
server transmit buffer
client receives and decodes buffer
Now, if we want to distribute APC, the client and server will need to talk to each other. And all of a sudden we find ourselves in a scenario that, with the exception of a few less important details, is identical to the one used by memcache. And all the expensive operations will become necessary again, ie all the copying around, sending through the network stack included.
That's also an explanation why even with a memcache instance running on localhost, without horribly slow gigabit ethernet between the nodes, there is a considerable overhead in what needs to be done to make a distributed system work.
And that's why I'm convinced you're looking at the wrong suspect here, make APC distributed and it will be in the same performance/throughput category.

Why does Perl makes the system very slow when I made more than 4,000 database connections?

I was writing a code to find the speed of my database using a Perl script.
My intention was to make a 4,000 database connection after each fork (which would act as a 4,000 different clients) and sleep, and I issue the update command when I get the signal
but the system itself becomes very slow and almost hangs for making the connections itself and even I couldn't send the signal using my terminal.
I am using DBI module, I have 4GB RAM in my system where Postgres 8.3 is running in a different machine.
I'm not entirely clear on whether you're saying you wanted to a) Open 4,000 connections, fork, open 4,000 more connections, etc. or b) Fork 4,000 times and open one connection from each process, but 4,000 database connections or 4,000 processes is some pretty serious resource consumption either way. I'm not at all surprised that it's slowing your system to a crawl - I would expect that to be the end result regardless of the language used.
What are you actually attempting to achieve by creating all of these processes and/or connections? There's probably a better way to do it that won't be quite so resource-intensive.
I've seen pgpool in use on production systems where the number of postgres connections could not be limited to something reasonable. You may wish to look into using this yourself to mitigate against poor application design by your developers.
Essentially, pgpool acts as a proxy to postgres. It multiplexes queries on lots of connections to a much smaller (and manageable) number to the back-end database.
That is relativity speaking a lot of connections to have at once, but not unheard of by any means. How much memory do you have on the database server? Each connection takes resources, if you don't have a database server setup to handle that volume of connections it will be slow no matter what language you use to connect.
A simple analogy would be if you had a Toyota Prius (old days I would had said Ford Pinto) pulling a semi trailer with 80,000 lbs (typical legal weight in a lot of the states) of weight in it. It would burn that little Prius up in a heartbeat like you are seeing. To do it right you need to buy your self a big rig and hook it to that trailer to move that amount of weight.
Ignoring the wisdom of doing 4000 connection forks, you should work through your performance issues with something akin to Devel::NYTProf.
I would alternatively setup persistent workers in gearman and do my gearman client requests. Persistence and your scheduled forks on demand.

What is the fastest way to read 10 GB file from the disk?

We need to read and count different types of messages/run
some statistics on a 10 GB text file, e.g a FIX engine
log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but
the language doesn't really matter.
I have found some interesting tips in Tim Bray's
WideFinder project. However, we've found that using memory mapping
is inherently limited by the 32 bit architecture.
We tried using multiple processes, which seems to work
faster if we process the file in parallel using 4 processes
on 4 CPUs. Adding multi-threading slows it down, maybe
because of the cost of context switching. We tried changing
the size of thread pool, but that is still slower than
simple multi-process version.
The memory mapping part is not very stable, sometimes it
takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from
page faults or something related to virtual memory usage.
Anyway, Mmap cannot scale beyond 4 GB on a 32 bit
architecture.
We tried Perl's IPC::Mmap and Sys::Mmap. Looked
into Map-Reduce as well, but the problem is really I/O
bound, the processing itself is sufficiently fast.
So we decided to try optimize the basic I/O by tuning
buffering size, type, etc.
Can anyone who is aware of an existing project where this
problem was efficiently solved in any language/platform
point to a useful link or suggest a direction?
Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).
Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.
This all depends on what kind of preprocessing you can do and and when.
On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files
until hours after they're created, and at creation time we don't really have any other load on the machines.
Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.
Perhaps you've already read this forum thread, but if not:
http://www.perlmonks.org/?node_id=512221
It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.
Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.
Best of luck.
I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.
PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.
Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).
Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.
Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.
Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.
I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.
hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit,
so just call it 5 times in sequence. That should be fairly fast.
If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.
If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).
Since you said platform and language doesn't matter...
If you want a stable performance that is as fast as the source medium allows for, the only way I am aware that this can be done on Windows is by overlapped non-OS-buffered aligned sequential reads. You can probably get to some GB/s with two or three buffers, beyond that, at some point you need a ring buffer (one writer, 1+ readers) to avoid any copying. The exact implementation depends on the driver/APIs. If there's any memory copying going on the thread (both in kernel and usermode) dealing with the IO, obviously the larger buffer is to copy, the more time is wasted on that rather than doing the IO. So the optimal buffer size depends on the firmware and driver. On windows good values to try are multiples of 32 KB for disk IO. Windows file buffering, memory mapping and all that stuff adds overhead. Only good if doing either (or both) multiple reads of same data in random access manner. So for reading a large file sequentially a single time, you don't want the OS to buffer anything or do any memcpy's. If using C# there's also penalties for calling into the OS due to marshaling, so the interop code may need bit of optimization unless you use C++/CLI.
Some people prefer throwing hardware at problems but if you have more time than money, in some scenarios it's possible to optimize things to perform 100-1000x better on a single consumer level computer than a 1000 enterprise priced computers. The reason is that if the processing is also latency sensitive, going beyond using two cores is probably adding latency. This is why drivers can push gigabytes/s while enterprise software is ends stuck at megabytes/s by the time it's all done. Whatever reporting, business logic and such the enterprise software do can probably also be done at gigabytes/s on two core consumer CPU, if written like you were back in the 80's writing a game. The most famous example I've heard of approaching their entire business logic in this manner is the LMAX forex exchange, which published some of their ring buffer based code, which was said to be inspired by network card drivers.
Forgetting all the theory, if you are happy with < 1 GB/s, one possible starting point on Windows I've found is looking at readfile source from winimage, unless you want to dig into sdk/driver samples. It may need some source code fixes to calculate perf correctly at SSD speeds. Experiment with buffer sizes also.
The switches /h multi-threaded and /o overlapped (completion port) IO with optimal buffer size (try 32,64,128 KB etc) using no windows file buffering in my experience give best perf when reading from SSD (cold data) while simultaneously processing (use the /a for Adler processing as otherwise it's too CPU-bound).
I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.
Hope it helps
Its not stated in the problem that sequence matters really or not. So,
divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.