What is the Overhead of matlabpool? - matlab

Could anyone point to me what is the overhead of running a matlabpool ?
I started a matlabpool :
matlabpool open 132procs 100
Starting matlabpool using the '132procs' configuration ... connected to 100 labs.
And followed cpu usage on the nodes as :
pdsh -A ps aux |grep dmlworker
When I launch the matlabpool, it starts with ~35% cpu usage on average and when the pool
is not being used it slowly (in 5-7 minutes) goes down to ~2% on average.
Is this normal ? What is the typical overhead ? Does that change if matlabpooljob is launched as a "batch" job ?

This is normal. ps aux reports the average CPU utilization since the process was started, not over a rolling window. This means that, although the workers initialize relatively quickly and then become idle, it will take longer for this to reflect in CPU%. This is different to the Linux top command, for example, which will reflect the utilization since the last screen update in %CPU.
As for typical overhead, this depends on a number of factors: clearly the number of workers, the rate and data size of jobs submitted (as well as in maintaining the worker processes, there is some overhead in marshalling input and output, which is not part of "useful computation"), whether the Matlab pool is local or attached to a job manager, and the Matlab version and O/S.
From experience, as a rough guide on a modern *nix server, I would think an idle worker should be not be consuming more than 20% of a single core (e.g. <~1% total CPU utilization on a 16-core box) after initilization, unless there is a configuration issue. I should not expect this to be influenced by what kind of jobs you are submitting (whether using "createJob" or "batch" or "parfor" for example): the workers and communication mechanisms underneath are essentially the same.

Related

Throughput vs latency in computer architecture

I've come across articles on "through-put vs latency" in contexts like networking e.g. https://homepage.cs.uri.edu/~thenry/resources/unix_art/ch12s04.html But in the context of computer architecture / operating systems, I'm not able to understand why would there be a trade-off between latency (response time of a program) and through-put (how many programs we're able to complete in a unit of time, say per hour). Is this solely due to the fact that we can choose to parallelize processing of multiple programs / requests leading to overheads like context switches & sharing of caches which make the start-to-end response time per process to be worse? Or am I missing something here?
In terms of single instructions in a superscalar pipelined out-of-order exec CPU, throughput vs. latency is very important because the CPU is trying to extract parallelism from an instruction stream that has to be executed as if in serial program order. See Assembly - How to score a CPU instruction by latency and throughput and the bottom of my answer on latency vs throughput in intel intrinsics for example.
In terms of OS decisions that affect throughput vs. latency on a much longer timescale than a few clock cycles, that's a totally separate question.
One of the major factors there is choosing how to use the available physical RAM, and whether to page out (to a swap file) infrequently used code / data to make more room to cache disk files. (e.g. Linux's vm.swappiness is widely considered a key tunable in terms of setting it differently between servers and desktops. https://unix.stackexchange.com/questions/88693/why-is-swappiness-set-to-60-by-default).
If you alt-tab to a window when many pages of that process have been paged out, it will take some time before the process can redraw its window. (Multiple hard page faults, can be quite slow especially if paging on a rotational disk, not SSD.) So to optimize for latency, you want the kernel to not aggressively swap out pages from running processes, even if they've been idle for a few hours. Those pages, if they'd been free, could have improved throughput for other processes by acting as buffers / cache.
A related factor is I/O scheduling: trying to group IO requests together to minimize HD seek times (for higher throughput and lower average latency), but sometimes at the expense of delaying a few requests for a longer time (higher worst-case latency). Linux for example has many to choose from, including deadline, Completely Fair Queuing (CFQ), and the original elevator (just grouping requests by locality without consideration of fairness or latency). https://wiki.archlinux.org/title/improving_performance#Input/output_schedulers
CPU scheduling is also a factor: a context-switch hurts throughput, as it takes time itself and caches will likely be cold for the new task on this CPU. You also have to run the kernel's schedule() function to decide which task to run next, so that takes away some time from real work.
To minimize latency (for example between a socket message being sent to a process and it waking up when its poll or select system call returns), you want a short timeslice, like Linux HZ=1000. (Timer interrupts every 1 ms to run the scheduler). And you want to be able to pre-empt even the kernel itself, instead of waiting until the kernel is ready to return to the old user-space to consider the possibility of running a different user-space task.
But neither of these helps throughput, and in fact hurt (assuming the workload has enough parallelism to not bottleneck on latency). So HZ=100 was the default for "server" Linux builds, vs. 1000 on "desktop" builds tuned for interactive use. (Modern Linux can be "tickless", not using a fixed timer interrupt on every core at all, instead deciding when to schedule the next interrupt on a case by case basis.)
Real-time kernels take this even further, spending more time on finer-grained locking and stuff like that to enable pausing work and coming back to it later to minimize interrupt latency and other latencies between it being time to do something and actually starting to do that thing. (There are real-time patches for Linux, and there are also totally separate kernels built from the ground up for real-time operation.)
If you have an embedded system controlling a motor or something, you absolutely need hard real-time latency guarantees that it will never take longer than say 1 millisecond from an interrupt pin being asserted to the interrupt handler starting to run.
(Designing the system to make these guarantees possible often comes at the cost of throughput. e.g. obviously you have to pin some memory to make it not swappable, if we're talking about user-space, making it unavailable for cache even if it goes untouched for days.)

PowerShell based Process specific resource monitoring

I am using PowerShell for some benchmarking of Autodesk Revit, and I would like to add some resource monitoring. I know I can get historical CPU utilization and RAM utilization in general, but I would like to be able to poll Process specific CPU and RAM utilization every 5 seconds or so.
In addition, I would love to be able to poll how many cores a process is currently using, the clock speed of those specific cores, and the frame rate of the screen that process is currently displayed on.
Are those things even accessible via PowerShell/.NET? Or is that low level stuff I just can't get to with PS?

Multi core machine - cpu load metric

In a multi core machine what is the best metric to understand whether cpu is loaded or not ?
I have a web application that sends a post request to apache CGI server. CGI server loops over the post data and launches perl process for each of the item in the loop. Since requests from clients ends up hitting a single endpoint, I am concerned if I end up creating lots of processes which my server can't handle. Hence I wanted to understand what system metric should I check before launching a new process from loop.
Note: I have a 20 core machine.
The reason the answer isn't easy to find, is that it depends on the nature of your processes, and which system constraint is your limiting factor.
For CPU intensive work, then the metric to look at is load average - load average is a measure of processes in a runnable state - very roughly if LA is the same as number of cores, then you're running your CPUs at maximum.
However, it's increasingly the case that CPU is not the limiting factor - you may have a finite amount of memory, and memory hungry processes will consume it. 'spare' memory is used for caching, so filling the whole lot up actually starts to slow things down (because you have a smaller cache). Over spilling the available will either cause swapping or OOMkiller.
But as you mention apache and web, then chances are pretty good that your network pipe is a limiting factor - controlling bandwidth from the local host is actually surprisingly hard.
And then there's disk IO - which may also be a factor - I think that's unlikely for a web server, because your outbound network will usually be a tighter limit.
It all depends what your processes are doing - if they're lightweight 'helpers' that are mostly idle, or heavyweight 'grinders' that all introduce noticeable load.
So the best answer I can give is a very vague estimate - if your processes are CPU intensive, cap them at 2 per core. If your processes are memory, aim to consume about 50% of your system RAM. If your processes are IO intensive, aim to consume about 50% of your IO (either network or disk).

Weird cgroups behaviour ( cpu shares

I have a cpu hungry process A which is taking too much of cpu load(100%) which causes process B to take not enough cycles...B is related to web response...so when i did a benchmark of web response with both the processes without cgroups, the result was 5 seconds...now when i create two groups and give both the processes equal amount of cpu.shares the time taken increases to 15 seconds.
i am getting good results with a high share ratio of cpu to the process wich has to be given more priority...but really curious about this weird behaviour at default values...
Why would the response time increase with the default share values of 1024 to both the groups, shouldn't it be the same as without cgroups ???
Now when i put both the processes in the same group, the response again goes back to 5 seconds...
Is it something related to the scheduler...
[ If you have cpuacct cgroup mounted with cpu, you can look at the usage numbers to check if both cgroups are getting equal shares. ]
My guess is that your setup has some processes running outside cgroups. When you move some of the processes under cgroups, the ones still outside will get higher cpu shares (in total) than two cgroups. (Each top-level process gets an equivalent of 1024 shares). To achieve what you want, all processes should be under some cgroup.

Does the use of the computer while benchmarking have an influence on the benchmark results?

Does using the computer for something else while benchmarking ( with the Benchmark module ) have an influence on the benchmark results?
Yes, it does. This running perl process complies with general process management rules your OS does. OS process scheduler will distribute CPU time amongst all running processes.
There is a way you can influence this distribution - nice command. It can be used to set process priority value, so the scheduler can give such process more CPU time.
The lesser nice priority value, the more CPU time the process will get.
For exmaple command nice -n -20 ./benchmark.pl will get almost all CPU time