i have developed a script for data extraction in perl.
using a parallel::ForkManager module and its functionalities
it works well when i run the script from 4core cpu.
but when i try to run it from different cpu which is of 2core only has a cpu usage 100%.
my problem is to reduce that cpu usage n run script smoothly.
i already reduces the max_Child_process to 10 previously it was 40.
restriction: my script must parse all pages and store data in database within a 25 seconds.
currently it is doing it in 22-24 seconds but uses full cpu. can anybody gives me some idea about how and what to do for reducing my cpu usage.
Use a usleep when waiting for child processes.
until (waitpid(-1, WNOHANG)) {
usleep(100);
}
Dramatically reduces cpu in parallel programs
Related
In a multi core machine what is the best metric to understand whether cpu is loaded or not ?
I have a web application that sends a post request to apache CGI server. CGI server loops over the post data and launches perl process for each of the item in the loop. Since requests from clients ends up hitting a single endpoint, I am concerned if I end up creating lots of processes which my server can't handle. Hence I wanted to understand what system metric should I check before launching a new process from loop.
Note: I have a 20 core machine.
The reason the answer isn't easy to find, is that it depends on the nature of your processes, and which system constraint is your limiting factor.
For CPU intensive work, then the metric to look at is load average - load average is a measure of processes in a runnable state - very roughly if LA is the same as number of cores, then you're running your CPUs at maximum.
However, it's increasingly the case that CPU is not the limiting factor - you may have a finite amount of memory, and memory hungry processes will consume it. 'spare' memory is used for caching, so filling the whole lot up actually starts to slow things down (because you have a smaller cache). Over spilling the available will either cause swapping or OOMkiller.
But as you mention apache and web, then chances are pretty good that your network pipe is a limiting factor - controlling bandwidth from the local host is actually surprisingly hard.
And then there's disk IO - which may also be a factor - I think that's unlikely for a web server, because your outbound network will usually be a tighter limit.
It all depends what your processes are doing - if they're lightweight 'helpers' that are mostly idle, or heavyweight 'grinders' that all introduce noticeable load.
So the best answer I can give is a very vague estimate - if your processes are CPU intensive, cap them at 2 per core. If your processes are memory, aim to consume about 50% of your system RAM. If your processes are IO intensive, aim to consume about 50% of your IO (either network or disk).
I am running into serious memory issues with my kdb process. Here is the architecture in brief.
The process runs in slave mode (4 slaves). It loads a ton of data from database into memory initially (total size of all variables loaded in memory calculated from -22! is approx 11G). Initially this matches .Q.w[] and close to unix process memory usage. This data set increases by very little in incremental operations. However, after a long operation, although the kdb internal memory stats (.Q.w[]) show expected memory usage (both used and heap) ~ 13 G, the process is consuming close to 25G on the system (unix /proc, top) eventually running out of physical memory.
Now, when I run garbage collection manually (.Q.gc[]), it frees up memory and brings unix process usage close to heap number displayed by .Q.w[].
I am running Q 2.7 version with -g 1 option to run garbage collection in immediate mode.
Why is unix process usage so significantly differently from kdb internal statistic -- where is the difference coming from? Why is "-g 1" option not working? When i run a simple example, it works fine. But in this case, it seems to leak a lot of memory.
I tried with 2.6 version which is supposed to have automated garbage collection. Suprisingly, there is still a huge difference between used and heap numbers from .Q.w when running with version 2.6 both in single threaded (each) and multi threaded modes (peach). Any ideas?
I am not sure of the concrete answer but this is my deduction based on following information (and some practical experiments) which is mentioned on wiki:
http://code.kx.com/q/ref/control/#peach
It says:
Memory Usage
Each slave thread has its own heap, a minimum of 64MB.
Since kdb 2.7 2011.09.21, .Q.gc[] in the main thread executes gc in the slave threads too.
Automatic garbage collection within each thread (triggered by a wsful, or hitting the artificial heap limit as specified with -w on the command line) is only executed for that particular thread, not across all threads.
Symbols are internalized from a single memory area common to all threads.
My observations:
Thread Specific Memory:
.Q.w[] only shows stats of main thread and not the summation of all the threads (total process memory). This could be tested by starting 'q' with 2 threads. Total memory in that case should be at least 128MB as per point 1 but .Q.w[] it still shows 64 MB.
That's why in your case at the start memory stats were close to unix stats as all the data was in main thread and nothing on other threads. After doing some operations some threads might have taken some memory (used/garbage) which is not shown by .Q.w[].
Garbage collector call
As mentioned on wiki, calling garbage collector on main thread calls GC on all threads. So that might have collected the garbage memory from threads and reduced the total memory usage which was reflected by reduced unix memory stats.
I'm running the Astropy tests in parallel using python setup.py test --parallel N option on my Macbook (4 real cores, solid state disk), which uses pytest-xdist to run the ~ 8000 tests in parallel.
I tried different N in the 1 to 10 range, but in all cases I can only get speed-ups of roughly 2, but I expected to get speedups in the 3 to 4 range (because running the tests should be CPU-limited).
Why are the speedups low and how can I get good speedups (using multiple cores on one computer)?
Update
I tried the ramdisk suggestion from #Iguananaut:
diskutil erasevolume HFS+ 'ramdisk' hdiutil attach -nomount ram://8388608
mkdir /Volumes/ramdisk/tmp
time python setup.py test -a '--basetemp=/Volumes/ramdisk/tmp' --parallel 8
The speedup is now ~ 2.2 compared to ~ 2.0 with the SSD.
Since I have four physical cores I expect something in the range 3 to 4.
Maybe the overhead for running the tests in parallel is very large for some reason.
I would suspect the SSD is the limiting factor there. Many of the tests are CPU bound, but just as many make heavy disk usage--temp files and the like. Those could perhaps be made even slower by running in parallel. Beyond that it's hard to say much since it depends on the particulars of your environment. I get significant speedup running the tests on six cores. Not quite 6x but it does make a difference.
One thing you might try is making a ramdisk to set as your temp directory. You can do this in OSX with diskutil. You can Google how to do that if you're not sure. Then you should be able to run ./setup.py test -A '--basetemp=path/to/ramdisk'. I haven't actually tried that with the Astropy tests and am not sure how it will work. But if it does work it will at least help somewhat rule out I/O as the bottleneck.
That said I'm being intentionally wishy-washy as to how much it might help. Even using a ramdisk--now your RAM's speed is becoming the bottleneck for I/O bound tests. No matter how many CPUs you have all the CPU-bound tests could finish instantly and the I/O-bound tests won't be made any faster, so you would still have to wait just as long (or almost as long for them to finish). With multiprocessing there's also additional overhead in message passing between the processes--exactly how this is being performed depends on a lot of factors but it's most likely through some shared memory. Anyone reading this also has no way of knowing what other processes are running on your machine that could be contending for those same resources. Even if your system monitor doesn't show anything making heavy use of the CPU, that doesn't mean there aren't processes doing other things that are adding to some bottleneck.
TL;DR I wouldn't make much of not getting a speedup directly proportional to the number of corse you throw at it, especially on something like a laptop.
Could anyone point to me what is the overhead of running a matlabpool ?
I started a matlabpool :
matlabpool open 132procs 100
Starting matlabpool using the '132procs' configuration ... connected to 100 labs.
And followed cpu usage on the nodes as :
pdsh -A ps aux |grep dmlworker
When I launch the matlabpool, it starts with ~35% cpu usage on average and when the pool
is not being used it slowly (in 5-7 minutes) goes down to ~2% on average.
Is this normal ? What is the typical overhead ? Does that change if matlabpooljob is launched as a "batch" job ?
This is normal. ps aux reports the average CPU utilization since the process was started, not over a rolling window. This means that, although the workers initialize relatively quickly and then become idle, it will take longer for this to reflect in CPU%. This is different to the Linux top command, for example, which will reflect the utilization since the last screen update in %CPU.
As for typical overhead, this depends on a number of factors: clearly the number of workers, the rate and data size of jobs submitted (as well as in maintaining the worker processes, there is some overhead in marshalling input and output, which is not part of "useful computation"), whether the Matlab pool is local or attached to a job manager, and the Matlab version and O/S.
From experience, as a rough guide on a modern *nix server, I would think an idle worker should be not be consuming more than 20% of a single core (e.g. <~1% total CPU utilization on a 16-core box) after initilization, unless there is a configuration issue. I should not expect this to be influenced by what kind of jobs you are submitting (whether using "createJob" or "batch" or "parfor" for example): the workers and communication mechanisms underneath are essentially the same.
Does using the computer for something else while benchmarking ( with the Benchmark module ) have an influence on the benchmark results?
Yes, it does. This running perl process complies with general process management rules your OS does. OS process scheduler will distribute CPU time amongst all running processes.
There is a way you can influence this distribution - nice command. It can be used to set process priority value, so the scheduler can give such process more CPU time.
The lesser nice priority value, the more CPU time the process will get.
For exmaple command nice -n -20 ./benchmark.pl will get almost all CPU time