How can I handle multiple sockets within a Perl daemon with large memory usage? - perl

I have created a client-server program with Perl using IO::Socket::INET. I access server through CGI based site. My server program will run as daemon and will accept multiple simultaneous connections. My server process consumes about 100MB of memory space (9 large arrays, many arrays...). I want these hashes to reside in memory and share them so that I don't have to create them for every connection. Hash creation takes 10-15 seconds.
Whenever a new connection is accepted through sockets, I fork a new process to take care of the processing for each connection received. Since parent process is huge, every time I fork, processor tries to allocate memory to a new child, but due to limited memory, it takes large time to spawn a new child, thereby increasing the response time. Many times it hangs down even for a single connection.
Parent process creates 9 large hashes. For each child, I need to refer to one or more hashes in read-only mode. I will not update hashes through child. I want to use something like copy-on-write, by which I can share whole 100mb or whole global variables created by parent with all child? or any other mechanism like threads. I expect the server will get minimum 100 request per second and it should be able to process all of them in parallel. On an average, a child will exit in 2 seconds.
I am using Cygwin on Windows XP with only 1GB of RAM. I am not finding any way to overcome this issue. Can you suggest something? How can I share variables and also create 100 child processes per second and manage them and synchronize them,
Thanks.

Instead of forking there are two other approaches to handle concurrent connections. Either you use threads or a polling approach.
In the thread approach for each connection a new thread is created that handles the I/O of a socket. A thread runs in the same virtual memory of the creating process and can access all of its data. Make sure to properly use locks to synchronize write access on your data.
An even more efficient approach is to use polling via select(). In this case a single process/thread handles all sockets. This works under the assumption that most work will be I/O and that the time spend with waiting for I/O requests to finish is spent handling other sockets.
Go research further on those two options and decide which one suits you best.
See for example: http://www.perlfect.com/articles/select.shtml

If you have that much data, I wonder why you don't simply use a database?

This architecture is unsuitable for Cygwin. Forking on real unix systems is cheap, but on fake unix systems like Cygwin it's terribly expensive, because all data has to be copied (real unices use copy-on-write). Using threads changes the memory usage pattern (higher base usage, but smaller increase per thread), but odds are it will still be inefficient.
I would advice you to use a single-process approach using polling, and maybe non-blocking IO too.

Related

A program resistent to power/hardware/OS failures

I need to write a program that performs a parallel search in a large space of possible states, with new areas being discovered (and their exploration started) in the process, and exploration of some areas being terminated early as intermediate results obtained elsewhere eliminate a possibility of discovering new useful results in them. The search is performed using multiple threads running in a heavy cooperation with each other to avoid recalculation of intermediate data.
A complex internal state (including call stacks of several threads and state synchronization primitives they use) has to be maintained and updated during the whole process, and there is no apparent way to split the computation into isolated chunks that can be executed sequentially, each saving and passing a small intermediate result to the next. Also, there is no way to split the computation into independent parallel threads not communicating with each other, without imposing a prohibitive overhead due to recalculation of large amount of intermediate data.
Because of the large search domain, the program would possibly run for months before producing a final result. Hence, there is a significant risk of power, hardware or OS failure during the program execution that can lead to a complete loss of all work that has been done to the moment. In such a case the program will need to restart all its computations from scratch.
I need a solution that can prevent a complete data loss in such cases. I thought of an execution engine/platform that continuously saves the current state of the process to a failure-resistant storage like a redundant disk array or database. But I understand that this approach can significantly slow down the process, even to a degree when there would be no benefit compared to an expected computation time including restarts due to possible failures.
In fact, I do not need an ideal solution that continuously saves the program state, and I can easily bear a loss of hours or maybe even days of work. A possible heavyweight solution that comes to my mind is to run the program inside a virtual machine, saving its snapshots from time to time, and restoring the machine after a possible host failure from a recent snapshot. This approach can also help to recover the program state after a random or preventable guest OS failure.
Is there a similar, but more lightweight solution limited to preserving a state of a single process? Or could you suggest any other approaches that can solve my problem?
You may want to look at using Erlang which allows large numbers of threads to run at relatively low cost. Because the thread cost is low, redundancy can be used to achieve increased reliability.
For the problem you present, a triple-redundancy scheme may be the way to go, where periodic checks for synchronization across the three (or more) systems would determine by vote who has failed.

Run time memory of perl script

I am having a perl script which is killed by a automated job whenever a high priority process comes as my script is running ~ 300 parallel jobs for downloading data and is consuming lot of memory. I want to figure out how much is the memory it takes during run time so that I can ask for more memory before scheduling the script or if I get to know using some tool the portion in my code which takes up more memory, I can optimize the code for it.
Regarding OP's comment on the question, if you want to minimize memory use, definitely collect and append the data one row/line at a time. If you collect all of it into a variable at once, that means you need to have all of it in memory at once.
Regarding the question itself, you may want to look into whether it's possible to have the Perl code just run once (rather than running 300 separate instances) and then fork to create your individual worker processes. When you fork, the child processes will share memory with the parent much more efficiently than is possible for unrelated processes, so you will, e.g., only need to have one copy of the Perl binary in memory rather than 300 copies.

Perl Threads faster than Sequentially processing?

Just wanted to ask whether it's true that parallel processing is faster than sequentially processing.
I've always thought that parallel processing is faster, so therefore, I did an experiment.
I benchmarked my scripts and found out that after doing a bunch of
sub add{
for ($x=0; $x<=200000; $x++){
$data[$x] = $x/($x+2);
}
}
threading seems to be slower by about 0.5 CPU secs on average. Is this normal or is it really true that sequentially processing is faster?
Whether parallel vs. sequential processing is better is highly task-dependent and you've already done the right thing: You benchmarked both and determined for your task (the one you benchmarked, not necessarily the one you actually want to do) which one is faster.
As a general rule, on a single processor, sequential processing tends to be better for tasks which are CPU-bound, because if you have two tasks each needing five seconds of CPU time to complete, then you'll need ten seconds of CPU time regardless of whether you do them sequentially or in parallel. Setting up multiple threads/processes will, therefore, provide no benefit, but it will create additional task-switching overhead while also preventing you from having any results until all results are available.
CPU-bound tasks on a multi-processor system tend to do better when run in parallel, provided that they can run independently of each other. If not, or if you're using a language/threading model/IPC model/etc. which forces all tasks to run on the same processor, then see "on a single processor" above.
Parallel processing is generally better for tasks which are I/O-bound, regardless of the number of processors available, because CPUs are fast and I/O is slow, so working in parallel allows one task to process its data while the other is waiting for I/O operations to complete. (This is why make -j2 tends to be significantly faster than a plain make, even on single-processor machines.)
But, again, these are all generalities and all have cases where they'll be incorrect. Only benchmarking will reveal the truth with certainty.
Perl threads are an extreme suck. You are better off in every case forking several processes.
When you create a new thread in perl, it does the following:
Make a copy - yes, a real copy - of every single perl data structure in scope, including those belonging to modules you didn't write
Start up what is almost a new, independent instance of perl in a new OS thread
If you then want to share anything (as it has now copied everything) you have to use the share function in the threads module. This is incredibly sucky, as it replaces your variable, with some tie() nonsense, which adds much-too-fine-grained locking around it to prevent concurrent access. Accessing a shared variable then causes a massive amount of implicit locking, and is incredibly slow.
So in short, perl threads:
Take a long time to start
waste loads of memory
Cannot share data efficiently anyway.
You are much better off with fork(), which does not copy every variable (the kernel does copy-on-write) unless you're on Windows.
There's no reason to assume that in a single CPU core system, parallel processing will be faster.
Consider this png example:
The red and blue lines at the top represent two tasks running sequentially on a single core.
The alternate red and blue lines at the bottom represent two task running in parallel on a single core.

Reading Multiple Files in Multiple Threads using C#, Slow !

I have an Intel Core 2 Duo CPU and i was reading 3 files from my C: drive and showing
some matching values from the files onto a EditBox on Screen.The whole process takes 2 minutes.Then I thought of processing each file in a separate thread and then the whole process is taking 2.30 minutes !!! i.e 30 seconds more than single threaded processing.
I was expecting the other way around !I can see both the Graphs in CPU usage history.Some one please explain to me what is going on ?
here is my code snippet.
foreach (FileInfo file in FileList)
{
Thread t = new Thread(new ParameterizedThreadStart(ProcessFileData));
t.Start(file.FullName);
}
where processFileData is the method that process the files.
Thanks!
The root of the problem is that the files are on the same drive and, unlike your dual core processor, your hard drive can only do one thing at a time.
If you read two files simultaneously, the disk heads will jump from one file to the other and back again. Given that your hard drive can read each file in roughly 40 seconds, it now has the additional overhead of moving its disk head between the three separate files many times during the read.
The fastest way to read multiple files from a single hard drive is to do it all in one thread and read them one after another. This way, the head only moves once per file read (at the very beginning) and not multiple times per read.
To optimize this process, you'll either need to change your logic (do you really need to read the whole contents of all three files?). Or purchase a faster hard drive/put the 3 files in three different hard drives and use threading/use a raid.
If you read from disk using multiple threads, then the disk heads will bounce around from one part of the disk to another as each thread reads from a different part of the drive. That can reduce throughput significantly, as you've seen.
For that reason, it's actually often a better idea to have all disk accesses go through a single thread, to help minimize disk seeks.
If your task is I/O bound and if it needs to run often, you might look at a tool like "contig" to make sure the layout of your files on disk is optimized / contiguous.
If you processing is mostly IO bound and CPU bound it make sense it take same time or even more.
How do you compare those files ? You should think what is the bottleneck of you application? IO output/input, CPU, memory ...
The multithreading is only interesting for CPU bound processing. i.e. complex calculation, comparison of data in memory, sorting etc ...
Since your process is IO bound, you should let the OS do your threading for you. Look at FileStream.BeginRead() for an example how to queue up your reads. Your EndRead() method can spin up your next request to read your next block of data pointing to itself to handle each subsequent completed block.
Also, with you creating additional threads, the OS has to manage more threads. And if a different CPU happens to get picked to handle the completed read, you've lost all of the CPU caching where your thread originated.
As you've found, you can't "speed up" an application just by adding threads.

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.