What does "cpu_time" represent exactly in libvirt? - virtualization

I can pull the following CPU values from libvirt:
virsh domstats vm1 --cpu-total
Domain: 'vm1'
cpu.time=6173016809079111
cpu.user=26714880000000
cpu.system=248540680000000
virsh cpu-stats vm1 --total
Total:
cpu_time 6173017.263233824 seconds
user_time 26714.890000000 seconds
system_time 248540.700000000 seconds
What does the cpu_time figure represent here exactly?
I'm looking to calculate CPU utilization as a percentage using this data.
Thanks

This was a surprisingly difficult question to answer! After pouring over the kernel code for a good while I've figured out what's going on here and its quite nice to learn what's going on.
Normally for a process on Linux, the overall CPU usage is simply the sum of the time spent in userspace and the time spent on kernel space. So naively one would have expected user_time + system_time to equal cpu_time. What I've discovered is that Linux tracks time spent by vCPU threads executing guest code separately from either userspace or kernelspace time.
Thus cpu_time == user_time + system_time + guest_time
So you can think of system_time + user_time as giving the overhead of QEMU / KVM on the host side. And cpu_time - (user_time + system_time) as giving the actual amount of time the guest OS was running its CPUs.
To calculate CPU usage, you probably just want to record cpu_time every N seconds and calculate the delta between two samples. eg usage % = 100 * (cpu_time 2 - cpu_time 1) / N

As per master pulled 2018-07-10 from https://github.com/libvirt/libvirt/ and as far as QEMU/KVM is concerned, it comes down to:
cpu.time = cpuacct.usage cgroup metric
cpu.{user,system} = cpuacct.stat cgroup metrics
Problem one may encounter is guest load = time load - system load - user load sometime leads to negative values (?!?), example given for a running QEMU/KVM guest (values are seconds), with Debian 9 stock kernel (4.9):
time system user total
2018-07-10T13:19:20Z 62308.67 9278.59 107968.33
2018-07-10T13:20:20Z 62316.08 9279.73 107970.73
delta 7.41 1.14 2.40 (2.40 < 7.41+1.14 ?!?)
Kernel bug ? (at least one person experiments something similar: https://lkml.org/lkml/2017/11/1/101)
One thing is certain: cpuacct.usage and cpuacct.stat do use a different logic to gather their metrics; this might explain the discrepancy (?).

Unfortunately, the above answers are NOT correct in the CPUACCT controller for a KVM guest:
cpu_time == user_time + system_time + guest_time (<-- wrong)
If you run a CPU-intensive benchmark compared to an I/O or network-intensive benchmark in the VM, you'll see that "guest time" does not match up in the formula.
Guest time (according to /proc/< pid >/stat) represents ONLY the time used by the VCPUs to run the guest virtual machine (While not exiting or handling I/O).
The CPUACCT controller's top-level parent directory for each KVM/libvirt guest includes both the time spent on the "emulator" and "vcpuX" sub-directories in their totality, including vhost kernel threads and non-VCPU pthreads running inside the QEMU main process, not just guest time or user/system time.
That makes the above formula wrong. The correct formula would be:
guest_time = sum(vcpuX)=>cpu.time - sum(vcpuX)=>(for each child: cpuacct.stat=>user + cpuacct.stat=>system)
You cannot simply use the top-level parent files to calculate guest time. That would be totally inaccurate under any I/O bound workload.

Related

HW IO and CPU low jitter application

I have a hardware IO task (write/read serial message) that has a strict jitter requirement of less than 200 micro seconds. I need to be able to isolate both a CPU core/s and hardware/interrupt.
I have tried 2 things that have helped but not gotten me all the way there.
Using <termios.h> to configure the tty device. Specifically setting VMIN=packet_size and VTIME=0
Isolcpus kernel argument in /etc/default/grub and running with taskset
I am still seeing upwards of 5 ms (5000 us) of jitter on my serial reads. I tested this same application with pseudo serial devices (created by socat) to eliminate the HW variable but am still seeing high jitter.
My test application right now just opens a serial connection, configures it, then does a while loop of writes/reads.
Could use advice on how to bring jitter down to 200 us or less. I am considering moving to a dual boot RTOS/Linux with shared memory, but would rather solve on one OS.
Real Application description:
Receive message from USB serial
Get PTP (precision time protocol) time within 200 us of receiving the first bit
Write packet received along with timestamp to shared memory buffer shared with a python application: <timestamp, packet>.
Loop.
Also on another isolated HW/core:
Read some <timestamp, packet> from a shared memory buffer
Poll PTP time until <timestamp>
Transmit <packet> at within 200 us of <timestamp> over USB serial
Loop
To reduce process latency/jitter:
Isolate some cores
/etc/default/grub... isolcpus=0
Never interrupt RT tasks
set /proc/sys/kernel/sched_rt_runtime_us to -1
Run high priority on isolated core
schedtool -a 0 -F -p 99 -n -20 -e $CMD
To reduce serial latency/jitter
File descriptor options O_SYNC
Ioctl ASYNC_LOW_LATENCY
Termios VMIN = message size and VTIME = 0
Use tcdrain after issuing write commands

Can a process ask for x amount of time but take y amount instead?

If I am running a set of processes and they all want these burst times: 3, 5, 2 respectively, with the total expected time of execution being 10 time units.
Is it possible for one of the processes to take up more that what they ask for? For example even though it asked for 3 it took 11 instead because it was waiting on the user to enter some input. So the total execution time turns out to be 18.
This was all done in a non-preemptive cpu scheduler.
The reality is that software has no idea how long anything will take - my CPU runs at a different "nominal speed" to your CPU, both our CPUs keep changing their speed for power management reasons, and the speed of software executed by both our CPUs is effected by things like what other CPUs are doing (especially for SMT/hyper-threading) and what other devices happen to be doing at the time (their effect on caches, shared RAM bandwidth, etc); and software can't predict the future (e.g. guess when an IRQ will occur and take some time and upset the cache contents, guess when a read from memory will take 10 times longer because there was a single bit error that ECC needed to correct, guess when the CPU will get hot and reduce its speed to avoid melting, etc). It is possible to record things like "start time, burst time and end time" as it happens (to generate historical data from the past that can be analysed) but typically these things are only seen in fabricated academic exercises that have nothing to do with reality.
Note: I'm not saying fabricated academic exercises are bad - it's a useful tool to help learn basic theory before moving on to more advanced (and more realistic) theory.
Instead; for a non-preemptive scheduler, tasks don't try to tell the scheduler how much time they think they might take - the task can't know this information and the scheduler can't do anything with that information (e.g. a non-preemptive scheduler can't preempt the task when it takes longer than it guessed it might take). For a non-preemptive scheduler; a task simply runs until it calls a kernel function that waits for something (e.g. read() that waits for data from disk or network, sleep() that waits for time to pass, etc) and when that happens the kernel function that was called ends up telling the scheduler that the task is waiting and doesn't need the CPU, and the scheduler finds a different task to run that can use the CPU; and if the task never calls a kernel function that waits for something then the task runs "forever".
Of course "the task runs forever" can be bad (not just for malicious code that deliberately hogs all CPU time as a denial of service attack, but also for normal tasks that have bugs), which is why (almost?) nobody uses non-preemptive schedulers. For example; if one (lower priority) task is doing a lot of heavy processing (e.g. spending hours generating a photo-realistic picture using ray tracing techniques) and another (higher priority) task stops waiting (e.g. because it was waiting for the user to press a key and the user did press a key) then you want the higher priority task to preempt the lower priority task "immediately" (e.g. because most users don't like it when it takes hours for software to respond to their actions).

Sensu Scheduler Oddness

I run < 24 checks on my systems. Servers are not regularly heavily loaded. Load averages keep well under 1 during normal operation.
I have noticed a re-occurring issue where the check-cpu check would start triggering high load averages on systems where there was no organic cause for high load. Further investigation showed the high load report was actually due to the check-cpu script running in parallel with other checks. Outside of the checks executing, cpu load was fine.
I upgraded from sensu 0.20 to 0.23 and continued to observe the same issue.
We found that a re-start of the sensu-server and sensu-client services would resolve the problem for a period of time (approximately 24 hours) and then it would return.
We theorized at this point, there must be some sort of time-delay in the dispatch / execution of the checks on the host which causes this overlap to eventually occur.
All checks are set to run at an interval of 30 or 60.
I decided to set the interval of the check-cpu check to 83, and the issue has not occurred since. Presumably because the check-cpu check does not coincide with any others, thus not seeing high cpu load during that short moment.
Is this some sort of inherent scheduling issue with sensu? Is it supposed to know how to dispatch checks with adequate spacing, or is this something that should be controlled by the interval parameter?
Thanks!
I have noticed that the checks drift in execution time. i.e they do not run exactly every 30 seconds but every 30.001s or something like that. I guess the drift might be different on different checks. So eventually you will run into the problem that the checks sync up and all run at the same time, causing the problem. Running more checks at regular intervals (30s, 60s etc) will make this problem occur more often. If you want a change to this problem you have to report it to sensu directly. I think they might fix it eventually since they probably want the system to be scalable.

Scheduling policies in Linux Kernel

Can there be more than two scheduling policies working at the same time in Linux Kernel ?
Can FIFO and Round Robin be working on the same machine ?
Yes, Linux supports no less then 4 different scheduling methods for tasks: SCHED_BATCH, SCHED_FAIR, SCHED_FIFO and SCHED_RR.
Regardless of scheduling method, all tasks also have a fixed hard priority (which is 0 for batch and fair and from 1- 99 for the RT schedulign methods of FIFO and RR). Tasks are first and foremost picked by priority - the highest priority wins.
However, with several tasks available for running with the same priority, that is where the scheduling method kicks in: A fair task will only run for its allotted weighted (with the weight coming from a soft priority called the task nice level) share of the CPU time with regard to other fair tasks, a FIFO task will run for a fixed time slice before yielding to another task (of the same priority - higher priority tasks always wins) and RR tasks will run till it blocks disregarding other tasks with the same priority.
Please note what I wrote above is accurate but not complete, because it does not take into account advance CPU reservation features, but it give the details about different scheduling method interact with each other.
yes !! now a days we have different scheduling policies at different stages in OS .. Round robin is done generally before getting the core execution ... fifo is done, at start stage of new coming process ... !!!

What is the size of Ready Queue in linux?

Yesterday i understood in my Advanced Operating Systems class that there will be a limit in the number of processes that can be allowed to be placed in the Ready Queue.I would like to know that number for different operating systems.And also what happens when that number is exceeded? Meaning : what if more than that number of processes are created?
I tried to see what happens by running a small program which is
int main()
{
while(1)
system(fork());
return 0;
}
The system immediately hung.Can anyone explain why my system hung?
Some systems place no limit and will simply keep appending to a running queue as needed. There are options to restrict the maximum number of processes that a system can use but by default there are no restrictions (on some systems). On Linux you can change the ulimit which is processes per user and if you set it to something like 500 or less you will see that this program will not hang the system and will simply just run and use up CPU cycles from doing constant context switches.
By the way, what you're doing there is called a Fork Bomb and it is a small denial exploit used to cause a denial of service attack on a computer that does not have a limit on processes per user.