Plesk: health monitoring thresholds per core or total? - server

plesk has a built-in health monitoring which lets you configure alarm thresholds for automatic notification. most of these thresholds are percentage-based to flag a notification if memory or cpu usage gets too high.
i'm having trouble determining how these percentages are measured. measuring for memory is easy (we're dealing with a fixed figure here) but cpu usage is more complicated on multi-proc servers.
CPU Info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Stepping: 7
CPU MHz: 1600.000
BogoMIPS: 6784.52
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Now am i right in thinking, that if a single core hits the 90% then this would trigger the Alarm on the Health Monitoring?
Most of my Flags are from 80% = Yellow to 90% = Red
And its pretty much always on Red, I believe this is because its Multi Core and the health tool is working on a single core.
if i use the command TOP with Shift and I
Then i get the overall CPU process, and its nothing along the lines as to what the health monitor is showing me the total % is.
I dont know if i have picked up false information or been miss guided, But maybe someone can help steer me in the right direction, or shine a little light on it at least
:)
Thanks

After alot of posts, Troubleshooting and pestering, i finally found the answer.
In Short, YES Plesk Health Monitoring does not really account for any core >1
So in my case 6 cores, When Core 6, Flicks to 80% for 1 second, The alarm is triggered.
But when you work on a Average, the CPU does not hit 12%
I asked this, over in the official Plesk Forum, and failed to get a response.
Many Serve companys and partners of Plesk, One did respond and say its a known issue that causes alot of headacks
You can increase your Load Average, from 1 minute to 15, This will reduce the alerts alot, Or well in my case it has.
1 Minute with CPU hitting 80% = Alarm
15 Minutes, CPU hits 80% for 30 seconds, Its average on the 15 minutes is 20% = No Alarm :)

Related

How are CPU resource units (millicore/millicpu) calculated under the hood?

Let's take this processor as an example: a CPU with 2 cores and 4 threads (2 threads per core).
From what I've read, such a CPU has 2 physical cores but can process 4 threads simultaneously through hyper threading. But, in reality, one physical core can only truly run one thread at a time, but using hyper threading, the CPU exploits the idle stages in the pipeline to process another thread.
Now, here is Kubernetes with Prometheus and Grafana and their CPU resource units measurement - millicore/millicpu. So, they virtually slice a core to 1000 millicores.
Taking into account the hyper threading, I can't understand how they calculate those millicores under the hood.
How can a process, for example, use 100millicore (10th part of the core)? How is this technically possible?
PS: accidentally, found a really descriptive explanation here: Multi threading with Millicores in Kubernetes
This gets very complicated. So k8s doesn't actually manage this it just provides a layer on top of the underlying container runtime (docker, containerd etc). When you configure a container to use 100 millicore k8's hands that down to the underlying container runtime and the runtime deals with it. Now once you start going to this level you have to start looking at the Linux kernel and how it does cpu scheduling / rate with cgroups. Which becomes incredibly interesting and complicated. In a nutshell though: The linux CFS Bandwidth Control is the thing that manages how much cpu a process (container) can use. By setting the quota and period params to the schedular you can control how much CPU is used by controlling how long a process can run before being paused and how often it runs. as you correctly identify you cant only use a 10th of a core. But you can use a 10th of the time and by doing that you can only use a 10th of the core over time.
For example
if I set quota to 250ms and period to 250ms. That tells the kernel that this cgroup can use 250ms of CPU cycle time every 250ms. Which means it can use 100% of the CPU.
if I set quota to 500ms and keep the period to 250ms. That tells the kernel that this cgroup can use 500ms of CPU cycle time every 250ms. Which means it can use 200% of the CPU. (2 cores)
if I set quota to 125ms and keep the period to 250ms. That tells the kernel that this cgroup can use 125ms of CPU cycle time every 250ms. Which means it can use 50% of the CPU.
This is a very brief explanation. Here is some further reading:
https://blog.krybot.com/a?ID=00750-cfae57ed-c7dd-45a2-9dfa-09d42b7bd2d7
https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html

Using a cluster of Raspberry Pi 4 as a cluster for number crunching?

So I am currently developing an algorithm in MATLAB that is computationally expensive but is parallel processing friendly. Given that, I have been using the parallel processing library but I am still falling short of my computation time goals.
I am currently running my algorithm on an Intel i7 8086k CPU (6 Core, 12 logical, #4.00GHz, turbo is 5GHz)
Here are my questions:
If I was to purchase, lets say 10 raspberry pi 4 SBCs (4 cores #1.5GHz), could I use my main desktop as the host and the PIs as the clients? (Let us assume I migrate my algorithm to C++ and run it in Ubuntu for now).
1a. If I was to go through with the build in question 1, will there be a significant upgrade in computation for the ~$500 spent?
1b. If I am not able to use my desktop as host (I believe this shouldn't be an issue), how many raspberry PIs would I need to equate to my current CPU or how many would I need to make it advantageous to work on a PI cluster vs my computer?
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Thanks for your help, any other advise and recommendations are welcome
Does your algorithm bottleneck on raw FMA / FLOPS throughput? If so then a cluster of weak ARM cores is more trouble than it's worth. I'd expect a used Zen2 machine, or maybe Haswell or Broadwell, could be good if you can find one cheaply. (You'd have to look at core counts, clocks, and FLOPS/$. And whether the problem would still not be memory bottlenecked on an older system with less memory bandwidth.)
If you bottleneck instead on cache misses from memory bandwidth or latency (e.g. cache-unfriendly data layout), there might possibly be something to gain from having more weaker CPUs each with their own memory controller and cache, even if those caches are smaller than your Intel.
Does Matlab use your GPU at all (e.g. via OpenCL)? Your current CPU's peak double (FP64) throughput from the IA cores is 96 GFLOPS, but its integrated GPU is capable of 115.2 GFLOPS. Or for single-precision, 460.8 GFLOPS GPU vs. 192 GFLOPS from your x86 cores. Again, theoretical max throughput, running 2x 256-bit SIMD FMA instructions per clock cycle per core on the CPU.
Upgrading to a beefy GPU could be vastly more effective than a cluster of RPi4. e.g. https://en.wikipedia.org/wiki/FLOPS#Hardware_costs shows that cost per single-precision GFLOP in 2017 was about 5 cents, adding big GPUs to a cheapo CPU. Or 79 cents per double-precision GFLOP.
If your problem is GPU-friendly but Matlab hasn't been using your GPU, look into that. Maybe Matlab has options, or you could use OpenCL from C++.
will there be a significant upgrade in computation for the ~$500 spent?
RPi4 model B has a Broadcom BCM2711 SoC. The CPU is Cortex-A72.
Their cache hierachy 32 KB data + 48 KB instruction L1 cache per core. 1MB shared L2 cache. That's weaker than your 4GHz i7 with 32k L1d + 256k L2 private per-core, and a shared 12MiB L3 cache. But faster cores waste more cycles for the same absolute time waiting for a cache miss, and the ARM chips run their DRAM at a competitive DDR4-2400.
RPi CPUs are not FP powerhouses. There's a large gap in the raw numbers, but with enough of them the throughput does add up.
https://en.wikipedia.org/wiki/FLOPS#FLOPs_per_cycle_for_various_processors shows that Cortex-A72 has peak FPU throughput of 2 double FLOPS per core per cycle, vs. 16 for Intel since Haswell, AMD since Zen2.
Dropping to single precision float improves x86 by a factor of 2, but A72 by a factor of 4. Apparently their SIMD units have lower throughput for FP64 instructions, as well as half the work per SIMD vector. (Some other ARM cores aren't extra slow for double, just the expected 2:1, like Cortex-A57 and A76.)
But all this is peak FLOPS throughput; coming close to that in real code is only achieved with well-tuned code with good computational intensity (lots of work each time the data is loaded into cache, and/or into registers). e.g. a dense matrix multiply is the classic example: O(n^3) FPU work over O(n^2) data, in a way that makes cache-blocking possible. Or Prime95 is another example.
Still, a rough back of the envelope calculation, being generous and assuming sustained non-turbo clocks for the Coffee Lake. (All 6 cores busy running 2x 256-bit FMA instructions per clock makes a lot of heat. That's literally what Prime95 does, so expect that level of power consumption if your code is that efficient.)
6 * 4GHz * 4 elements/vec * 2 vec/cycle = 48G FMAs / sec = 96 GFLOP/sec on the CFL
4 * 1.5GHz * 2 DP flops / clock = 12 GFLOP / sec per RPi.
With 5x RPi systems, that's 60 GFLOPS added to your existing 96 GFLOP.
Doesn't sound worth the trouble to manage 5 RPi systems for less than your existing total FP throughput. But again, if your problem has the right kind of parallelism, a GPU can run it much more efficiently. 60 GFLOPS for 500$ is not a good deal compared to ~50$ per 60 GFLOP from a high-end (in 2017) video card.
The GPU in an RPi might have some compute capability, but almost certainly not worth it compared to slapping a 500$ discrete GPU into your existing machine if your code is CPU-friendly.
Or your problem might not scale with theoretical max FLOPS, but instead perhaps with cache bandwidth or some other factor.
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Zero clue; I'm only considering theoretical best case for efficient machine code running on these CPUs.

What is the maximum memory per worker that MATLAB can address?

Short version: Is there a maximum amount of RAM / worker, that MATLAB can address?
Long version: My wife uses MATLAB's parallel processing capabilities in data-heavy spatial analyses (I don't really understand it, I just know how to build computers that make her work quicker) and I would like to build a new computer so she can radically reduce her process times.
I am leaning toward something in the 10-16 core range, since prices on such processors seem to be dropping with each generation and I would like to use 128 GB of RAM, because 'why not' if you can stomach the cost and see some meaningful time savings?
The number of cores I shoot for will depend on the maximum amount of RAM that MATLAB can address for each worker (if such a limit exists). The computer I built for similar work in 2013 has 4 physical cores (Intel i7-3770k) and 32 GB RAM (which she maxed out), and whatever I build next, I would like to have at least the same memory/core. With 128 GB of RAM a given, 10 cores would be 12.8 GB/core, 12 cores would be ~10.5 GB/core and 16 cores would be 8 GB/core. I am inclined to maximize cores rather than memory, but since she doesn't know what will benefit her processes the most, I would like to know how realistic those three options are. As for your next question, she has an nVidia GPU capable of parallel processing, but she believes her processes would not benefit from its CUDA cores.
Thank you for your insights. Many, many Google searches did not yield an answer.

Amazon RDS Strange metrics CPU Credit Balance

We have RDS PostgreSQL instance with type db.t2.small . And we have some strange moment with cpu credit balance metrics .
CPU Credit Usage not growing but balance is down to zero . Anybody know in what could be problem? (RDS instance working fine without any problems).
I am seeing the same behavior with my T2-micro free tier RDS instance. My hypothesis right now is that the service window is when the instance is getting rebooted or hot swapped, resulting in a new instance with the default baseline number of credits. This makes Saturday night more appealing than Sunday night in order to be sure by the next business day credits re-accumulate.
From the documentation, it looks like CPU credits expire 24 hours after being earned.
CPUCreditUsage
[T2 instances] The number of CPU credits consumed by the instance. One
CPU credit equals one vCPU running at 100% utilization for one minute
or an equivalent combination of vCPUs, utilization, and time (for
example, one vCPU running at 50% utilization for two minutes or two
vCPUs running at 25% utilization for two minutes).
CPU credit metrics are available only at a 5 minute frequency. If you
specify a period greater than five minutes, use the Sum statistic
instead of the Average statistic.
Units: Count
CPUCreditBalance
[T2 instances] The number of CPU credits available for the instance to
burst beyond its base CPU utilization. Credits are stored in the
credit balance after they are earned and removed from the credit
balance after they expire. Credits expire 24 hours after they are
earned.
CPU credit metrics are available only at a 5 minute frequency.
Units: Count
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html

same program is much slower on a supposedly better machine

When running the same application on two different machines, I see one is much slower but it ought to be the faster of the two. This is a compute bound application with a thread pool. The threads do not communicate with each other nor externally. The application reads from disk at the beginning (for a fraction of a second) and writes to disk at the end (for a fraction of a second).
The program repeatedly runs a simulation on a deterministically changing set of inputs. Since the inputs are identical the outputs can be compared and they are in fact identical. The only difference is the elapsed time. There is an object that I recall is "shared" in the sense that all threads read from it but my recollection is that this is strictly read-only. The work threaded is homogeneous.
Dual machine: 2 core / 4 thread machine, 2.53 GHz, 3MB cache, 8GB RAM, passmark.com benchmark is approximately 2100, my application's thread pool size set to 4, JVM memory high water mark was 2.8 GB, elapsed time is 47 minutes
Quad machine: 4 core / 8 thread machine, 2.2 GHz to 3.1 GHz, 6MB cache, 8GB RAM, passmark.com benchmark is approximately 6000, my application's thread pool size set to 8, JVM memory high water mark was 2.8GB, elapsed time 164 minutes
Another comparison:
Dual machine: thread pool size set to 2, elapsed time 98 minutes * Could be less. Please see the footnote.
Quad machine: thread pool size set to 2, elapsed time 167 minutes
*Probably should be less than 98 minutes since I was also playing an audio file. This means the anomaly is worse than this result makes it appear.
The jvisualvm profiles seem similar but due to what seem to be profiler glitches I haven't gotten much use from it. I'm looking for suggestions on where to look.
Both machines are Ubuntu 14.04.3 and on Java 8.
The answer is: collect more data and draw some conclusions. It appears that when comparing these two systems some conclusions can be drawn but they might not extend to the chipsets or the processors.
Reviewing the data in the original posting and the following measurements, it appears that for small data sets not only does the quad system's hyperthreading not significantly improve throughput, but that even going beyond 2 threads on a 4 core device does not improve throughput per unit of time, at least with these particular homogenous workloads. For large data sets it appears that hyperthreading reduces throughput per unit of time. Note the 2933 second result compared to an average of 1883 seconds (mean of 2032 and 1734).
The dual core hyperthreading is amazingly good, scaling well across the thread pool size dimension. The dual core also scaled well across the data set size dimension.
All measurements are elapsed times. Other means can be inferred, for example 2032 and 1734 can be averaged.