Understand CPU utilisation with image preprocessing applications

Understand CPU utilisation with image preprocessing applications - cpu-architecture

I'm trying to understand how to compute the CPU utilisation for audio and video use cases.
In real time audio applications, this is what I typically do:
if an application takes 4ms to process 28ms of audio data, I say that the CPU utilisation is 14.28% (4/28).
How should this be done for applications like resize/crop? let's say I'm resizing an image from 162*122 to 128*128 size image at 1FPS, and it takes 11ms.. What would be the CPU utilisation?

CPU utilization is quite complicated, and strongly depends on stuff like:
The CPU itself
The algorithms utilized for the task
Other tasks running alongside the CPU
CPU utilization is also strongly related to the process scheduling of your PC, hence the operating system used, so most operating systems will expose some kind of API for CPU utilization diagnostics, but such API is highly platform-dependent.
But how does CPU utilization calculations work anyway?
The most simple way in which CPU utilization is calculated is taking a (for example) 1 second period, in which you observe how long the CPU has been idling (not executing any processes), and divide that by the time interval you selected. For example, if the CPU did useful calculations for 10 milliseconds, and you were observing for 500ms, this would mean that the CPU utilization is 2%.
Answering your question / TL; DR
You can apply this principle in your program. For the case you provided (processing video), this could be done in more or less the same way: you calculate how long it takes to calculate one frame, and divide that by the length of a frame (1 / FPS). Of course, this could be done for a longer period of time, to get a more accurate reading, in the following way: you track how much time it takes to process, for example, 2 seconds of video, and divide that by 2. Then, you'll have your CPU utilization.
NOTE: if you aren't able to process the frame in time, for example, your video is 10FPS (0.1ms), and processing one frame takes 0.5ms, then your CPU utilization will be seemingly 500%, but obviously you can't utilize more than 100% of your CPU, so you should just cap the CPU utilization at 100%.

Related

Is there a way to set worker weight?

I have two machine to do the load test. One machine has worse CPU performance. And the machine will reach high CPU usage when number of users keep increasing, while the other machine still has low CPU usage. Locust complains:
[2022-07-28 11:22:15,529] PF1YW96X-MUO/WARNING/root: CPU usage above 90%! This may constrain your throughput and may even give inconsistent response time measurements! See https://docs.locust.io/en/stable/running-locust-distributed.html for how to distribute the load over multiple CPU cores or machines
[2022-07-28 11:25:06,766] PF1YW96X-MUO/WARNING/locust.runners: CPU usage was too high at some point during the test! See https://docs.locust.io/en/stable/running-distributed.html for how to distribute the load over multiple CPU cores or machines
I want to set lower weight for the machine who has worse CPU perfomance. Is there a way to do that?

You can run fewer worker processes on the weak machine. If necessary you could run more than one process per core on the strong machine, just to make it take more Users.

How are CPU resource units (millicore/millicpu) calculated under the hood?

Let's take this processor as an example: a CPU with 2 cores and 4 threads (2 threads per core).
From what I've read, such a CPU has 2 physical cores but can process 4 threads simultaneously through hyper threading. But, in reality, one physical core can only truly run one thread at a time, but using hyper threading, the CPU exploits the idle stages in the pipeline to process another thread.
Now, here is Kubernetes with Prometheus and Grafana and their CPU resource units measurement - millicore/millicpu. So, they virtually slice a core to 1000 millicores.
Taking into account the hyper threading, I can't understand how they calculate those millicores under the hood.
How can a process, for example, use 100millicore (10th part of the core)? How is this technically possible?
PS: accidentally, found a really descriptive explanation here: Multi threading with Millicores in Kubernetes

This gets very complicated. So k8s doesn't actually manage this it just provides a layer on top of the underlying container runtime (docker, containerd etc). When you configure a container to use 100 millicore k8's hands that down to the underlying container runtime and the runtime deals with it. Now once you start going to this level you have to start looking at the Linux kernel and how it does cpu scheduling / rate with cgroups. Which becomes incredibly interesting and complicated. In a nutshell though: The linux CFS Bandwidth Control is the thing that manages how much cpu a process (container) can use. By setting the quota and period params to the schedular you can control how much CPU is used by controlling how long a process can run before being paused and how often it runs. as you correctly identify you cant only use a 10th of a core. But you can use a 10th of the time and by doing that you can only use a 10th of the core over time.
For example
if I set quota to 250ms and period to 250ms. That tells the kernel that this cgroup can use 250ms of CPU cycle time every 250ms. Which means it can use 100% of the CPU.
if I set quota to 500ms and keep the period to 250ms. That tells the kernel that this cgroup can use 500ms of CPU cycle time every 250ms. Which means it can use 200% of the CPU. (2 cores)
if I set quota to 125ms and keep the period to 250ms. That tells the kernel that this cgroup can use 125ms of CPU cycle time every 250ms. Which means it can use 50% of the CPU.
This is a very brief explanation. Here is some further reading:
https://blog.krybot.com/a?ID=00750-cfae57ed-c7dd-45a2-9dfa-09d42b7bd2d7
https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html

Set cpu requests in K8s for fluctuating load

I have a service deployed in Kubernetes and I am trying to optimize the requested cpu resources.
For now, I have deployed 10 instances and set spec.containers[].resources.limits.cpu to 0.1, based on the "average" use. However, it became obvious that this average is rather useless in practice because under constant load, the load increases significantly (to 0.3-0.4 as far as I can tell).
What happens consequently, when multiple instances are deployed on the same node, is that this node is heavily overloaded; pods are no longer responsive, are killed and restarted etc.
What is the best practice to find a good value? My current best guess is to increase the requested cpu to 0.3 or 0.4; I'm looking at Grafana visualizations and see that the pods on the heavily loaded node(s) converge there under continuous load.
However, how can I know if they would use more load if they could before becoming unresponsive as the node is overloaded?
I'm actually trying to understand how to approach this in general. I would expect an "ideal" service (presuming it is CPU-focused) to use close to 0.0 when there is no load, and close to 1.0 when requests are constantly coming in. With that assumption, should I set the cpu.requests to 1.0, taking a perspective where actual constant usage is assumed?
I have read some Kubernetes best practice guides, but none of them seem to address how to set the actual value for cpu requests in practice in more depth than "find an average".

Basically come up with a number that is your lower acceptable bound for how much the process runs. Setting a request of 100m means that you are okay with a lower limit of your process running 0.1 seconds for every 1 second of wall time (roughly). Normally that should be some kind of average utilization, usually something like a P99 or P95 value over several days or weeks. Personally I usually look at a chart of P99, P80, and P50 (median) over 30 days and use that to decide on a value.
Limits are a different beast, they are setting your CPU timeslice quota. This subsystem in Linux has some persistent bugs so unless you've specifically vetted your kernel as correct, I don't recommend using it for anything but the most hostile of programs.

In a nutshell: Main goal is to understand how much traffic a pod can handle and how much resource it consumes to do so.
CPU limits are hard to understand and can be harmful, you might want
to avoid them, see static policy documentation and relevant
github issue.
To dimension your CPU requests you will want to understand first how much a pod can consume during high load. In order to do this you can :
disable all kind of autoscaling (HPA, vertical pod autoscaler, ...)
set the number of replicas to one
lift the CPU limits
request the highest amount of CPU you can on a node (3.2 usually on 4cpu nodes)
send as much traffic as you can on the application (you can achieve simple Load Tests scenarios with locust for example)
You will eventually end up with a ratio clients-or-requests-per-sec/cpu-consumed. You can suppose the relation is linear (this might not be true if your workload complexity is O(n^2) with n the number of clients connected, but this is not the nominal case).
You can then choose the pod resource requests based on the ratio you measured. For example if you consume 1.2 cpu for 1000 requests per second you know that you can give each pod 1 cpu and it will handle up to 800 requests per second.
Once you know how much a pod can consume under its maximal load, you can start setting up cpu-based autoscaling, 70% is a good first target that can be refined if you encounter issues like latency or pods not autoscaling fast enough. This will avoid your nodes to run out of cpu if the load increases.
There are a few gotchas, for example single-threaded applications are not able to consume more than a cpu. Thus if you give it 1.5 cpu it will run out of cpu but you won't be able to visualize it from metrics as you'll believe it still can consume 0.5 cpu.

About CPU operation and I/O processing

My question is why do we want to have CPU's operation overlap with that of the I/O processing. I have been thinking about optimization and such but yet to arrive at a conclusion.
If anyone is able to answer this question, it will be great. :D

I/O is generally very slow compared to the operating frequency of the CPU.
Suppose you have a 1GHz CPU that's capable of executing one instruction every clock cycle. That means the CPU is able to execute one instruction every nanosecond.
Now let's assume you want to fetch some data from your hard drive. Disk operations often take place in the milisecond scale, and we'll assume your drives are fast enough to fetch the data in only 1ms.
If the CPU just sit around and wait for the disk to fetch the data, the CPU will waste 1 million nanoseconds doing nothing, whereas it could be executing 1 million instructions for another task. When a program has a lot of IO access, those wasted cycles stacks up and become noticeable if you let the CPU wait and do nothing. This is why it's a good idea to overlap computation with IO so CPU cycles aren't wasted.
This is also why your computer becomes super unresponsive when your main memory is full, and the CPU has to page frequently to the disk. Your CPU cannot perform any useful task unless the data it needs has been retrieved from the disk into the main memory, so it must sit around and wait for the IOs to complete.

Xcode Instruments CPU time

if i run an application with the performance test, the "cpu monitor" show me some informations like process ID/Name or CPU Time. But in which unit of time does it measure ?
An example: if i get 05.04 , what does mean for me
Best Regards

Plagiarized from http://en.wikipedia.org/wiki/CPU_time -
CPU time (or CPU usage, process time) is the amount of time for which a central processing unit (CPU) was used for processing instructions of a computer program, as opposed to, for example, waiting for input/output (I/O) operations.
The CPU time is often measured in clock ticks or seconds. CPU time is also mentioned as percentage of the CPU's capacity at any given time on multi-tasking environment. That helps in figuring out how a CPU’s computational power is being shared among multiple computer programs.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse