Real time process missing deadline with SCHED_RR - real-time

I have below configs on ARMv7 embedded OMAP system.
sched_rt_period_us = 1000000 = 1 sec
sched_rt_runtime_us = 950000 = 0.95 sec
And i have 4 Real time processes running with SCHED_RR and pri = 1
and sched_rr_get_interval () returned 93750000 nanosec, i.e. 0.093750 sec on system.
I have added a new process with SCHED_RR and pri of 1 and same default rr_interval
of 0.09375 sec.
According to this configs:
On every second 5 RT processes must execute 2 times each (0.09375 * 10 = 0.9375 sec) and
rest of the time interval of 1 Sec is to be used by non-RT tasks
i.e., 1.0 - 0.9375 = 0.0625 Sec.
But as i see from execution the 5th newly added task misses the timeline and only executes randomly and produces output every 1 sec or indeterminate. Please help me on how to make
this new process deterministic so that it executes twice per sec as per above configs.
I tried to configure static pri of 2 and also checked with SCHED_FIFO but got the same
results.
Or is there anything i am missing in these calculations.
I am using :
Linux xxxx 2.6.33 #2 PREEMPT Tue Aug 14 16:13:05 CEST 2012 armv7l GNU/Linux

Are you sure that the scheduler does not fail because it is not able to honor the scheduling requests? I mean, that fifth task doesn't meet the deadline because the system is too heavily loaded?
As far as I know, sched_setscheduler does not have a way to signal that the system load is too heavy. To know if the system is able to meet the request, you need another scheduling algorithm, such as edf. Maybe you want to check its implementation for linux.

Related

How does jmeter starts sending requests to server

If Thread: 100, Rampup: 1 and Loop count: 1 is the configuration, how will jmeter start sending requests to the server?
Request will be sent 1 req/sec or all requests will be sent all at once to server?
JMeter will send requests as fast as it can, to wit:
It will start all threads (virtual users) you define in Thread Group within the ramp-up period (in your case - 100 threads in 1 second)
Each thread (virtual user) will start executing Samplers which are present in the Thread Group upside down (or according to the Logic Controllers)
When there are no more samplers to execute or loops to iterate the thread will be shut down
When there are no more active threads left - JMeter test will end.
With regards to requests per second - it mostly depends on your application response time, i.e.
if you have 100 virtual users and response time is 1 second - you will get 100 requests/second
if you have 100 virtual users and response time is 2 seconds - you will get 50 requests/second
if you have 100 virtual users and response time is 500 milliseconds - you will get 200 requests/second
etc.
I would recommend increasing (and decreasing) the load gradually, this way you will be able to correlate increasing load with increasing throughput/response time/number of errors, etc. while releasing all threads at once will not tell you the full story (unless you're doing a form of spike testing, in this case consider using Synchronizing Timer)
JMeter's ramp-up period set as 1 means to start all 100 threads in 1 second.
This isn't recommended settings as describe below
The ramp-up period tells JMeter how long to take to "ramp-up" to the full number of threads chosen. If 10 threads are used, and the ramp-up period is 100 seconds, then JMeter will take 100 seconds to get all 10 threads up and running. Each thread will start 10 (100/10) seconds after the previous thread was begun. If there are 30 threads and a ramp-up period of 120 seconds, then each successive thread will be delayed by 4 seconds.
Ramp-up needs to be long enough to avoid too large a work-load at the start of a test, and short enough that the last threads start running before the first ones finish (unless one wants that to happen).
Start with Ramp-up = number of threads and adjust up or down as needed.
See also Can i set ramp up period 0 in JMeter?
bear in mind that with low rampup and many threads, you may be limited by local resources, so your results may be a measurement of client capability rather than server.

What does "cpu_time" represent exactly in libvirt?

I can pull the following CPU values from libvirt:
virsh domstats vm1 --cpu-total
Domain: 'vm1'
cpu.time=6173016809079111
cpu.user=26714880000000
cpu.system=248540680000000
virsh cpu-stats vm1 --total
Total:
cpu_time 6173017.263233824 seconds
user_time 26714.890000000 seconds
system_time 248540.700000000 seconds
What does the cpu_time figure represent here exactly?
I'm looking to calculate CPU utilization as a percentage using this data.
Thanks
This was a surprisingly difficult question to answer! After pouring over the kernel code for a good while I've figured out what's going on here and its quite nice to learn what's going on.
Normally for a process on Linux, the overall CPU usage is simply the sum of the time spent in userspace and the time spent on kernel space. So naively one would have expected user_time + system_time to equal cpu_time. What I've discovered is that Linux tracks time spent by vCPU threads executing guest code separately from either userspace or kernelspace time.
Thus cpu_time == user_time + system_time + guest_time
So you can think of system_time + user_time as giving the overhead of QEMU / KVM on the host side. And cpu_time - (user_time + system_time) as giving the actual amount of time the guest OS was running its CPUs.
To calculate CPU usage, you probably just want to record cpu_time every N seconds and calculate the delta between two samples. eg usage % = 100 * (cpu_time 2 - cpu_time 1) / N
As per master pulled 2018-07-10 from https://github.com/libvirt/libvirt/ and as far as QEMU/KVM is concerned, it comes down to:
cpu.time = cpuacct.usage cgroup metric
cpu.{user,system} = cpuacct.stat cgroup metrics
Problem one may encounter is guest load = time load - system load - user load sometime leads to negative values (?!?), example given for a running QEMU/KVM guest (values are seconds), with Debian 9 stock kernel (4.9):
time system user total
2018-07-10T13:19:20Z 62308.67 9278.59 107968.33
2018-07-10T13:20:20Z 62316.08 9279.73 107970.73
delta 7.41 1.14 2.40 (2.40 < 7.41+1.14 ?!?)
Kernel bug ? (at least one person experiments something similar: https://lkml.org/lkml/2017/11/1/101)
One thing is certain: cpuacct.usage and cpuacct.stat do use a different logic to gather their metrics; this might explain the discrepancy (?).
Unfortunately, the above answers are NOT correct in the CPUACCT controller for a KVM guest:
cpu_time == user_time + system_time + guest_time (<-- wrong)
If you run a CPU-intensive benchmark compared to an I/O or network-intensive benchmark in the VM, you'll see that "guest time" does not match up in the formula.
Guest time (according to /proc/< pid >/stat) represents ONLY the time used by the VCPUs to run the guest virtual machine (While not exiting or handling I/O).
The CPUACCT controller's top-level parent directory for each KVM/libvirt guest includes both the time spent on the "emulator" and "vcpuX" sub-directories in their totality, including vhost kernel threads and non-VCPU pthreads running inside the QEMU main process, not just guest time or user/system time.
That makes the above formula wrong. The correct formula would be:
guest_time = sum(vcpuX)=>cpu.time - sum(vcpuX)=>(for each child: cpuacct.stat=>user + cpuacct.stat=>system)
You cannot simply use the top-level parent files to calculate guest time. That would be totally inaccurate under any I/O bound workload.

Is this an intelligent use case for optaPlanner?

I'm trying to clean up an enterprise BI system that currently is using a prioritized FIFO scheduling algorithm (so a priority 4 report from Tuesday will be executed before priority 4 reports from Thursday and priority 3 reports from Monday.) Additional details:
The queue is never empty, jobs are always being added
Jobs range in execution time from under a minute to upwards of 24 hours
There are 40 some odd identical app servers used to execute jobs
I think I could get optaPlanner up and running for this scenario, with hard rules around priority and some soft rules around average time in the queue. I'm new to scheduling optimization so I guess my question is what should I be looking for in this situation to decide if optaPlanner is going to help me or not?
The problem looks like a form of bin packing (and possibly job shop scheduling), which are NP-complete, so OptaPlanner will do better than a FIFO algorithm.
But is it really NP-complete? If all of these conditions are met, it might not be:
All 40 servers are identical. So running a priority report on server A instead of server B won't deliver a report faster.
All 40 servers are identical. So total duration (for a specific input set) is a constant.
Total makespan doesn't matter. So given 20 small jobs of 1 hour and 1 big job of 20 hours and 2 machines, it's fine that it takes all small jobs are done after 10 hours before the big job starts, given a total makespan of 30 hours. There's no desire to reduce the makespan to 20 hours.
"the average time in the queue" is debatable: do you care about how long the jobs are in the queue until they are started or until they are finished? If the total duration is a constant, this can be done by merely FIFO'ing the small jobs first or last (while still respecting priority of course).
There are no dependencies between jobs.
If all these conditions are met, OptaPlanner won't be able to do better than a correctly written greedy algorithm (which schedules the highest priority job that is the smallest/largest first). If any of these conditions aren't met (for example you buy 10 new servers which are faster), then OptaPlanner can do better. You just have to evaluate if it's worth spending 1 thread to figure that out.
If you use OptaPlanner, definitely take a look at real-time scheduling and daemon mode, to replan as new reports enter the system.

Why speed of variable length pipeline is determined by the slowest stage + whats the total execution time of program?

I am new to pipelining and I need some help regarding the fact that
The speed of the pipelining is determined by the speed of the slowest stage
Not only this, if I am given a 5 stage pipeline with duration of them 5 ns,10 ns, 8 ns, 7 ns,7 ns respectively , it is said that each instruction would take 10 ns time.
Can I get a clear explanation for this?
(edited)
Also let my program has 3 instructions I1,I2,I3 and I take 1 clk cycle duration = 1ns
such that the above stages take - 5, 10, 8 , 7 , 7 clock cycles respectively.
Now according to theory a snapshot of the pipeline would be -
But that gives me a total time to be -no of clk cycles*clk cycle duration = 62 * 1 = 62 ns
But according to theory total time should be - (slowest stage) * no. of instructions = 10 * 3 = 30 ns
Though I have an idea why slowest stage is important (each pipeline stage needs to wait hence 1 instruction is produced after every 10 clk cycle- but the result is inconsistent when i calculate it using clk cycles.Why this inconsistency? What am I missing??
(edited)
Assume a car manufacturing process. Assume it's used two stage pipe lining. Say it takes 1 day to manufacture an engine. 2 days to manufacture the rest. You can do both stages in parallel. What is your car output rate? It should be one car per 2 days. Although you manufacture the rest in 1 day, you have to wait another day to get the engine.
In your case, although other stages finish their job in lesser time, you have to wait 10ns to get the whole process done
Staging allows you to do the "parts" of the same operation at onces.
I'll create a smaller example here, dropping the last 2 stages of your example: 5, 10, 8 ns
Let's take two operations:
5 10 8
5 10 8
| The first operation starts here
| At stage 2 the second operation can start it's fist stage
| However, since the stages take different amount of times,
| the longest ones determines the runtime
| the thirds stage can only start after the 2nd has completed: after 15ns
| this is also true for the 2nd stage of the 2nd operation
I am not sure about the source of your confusion. If one unit of your pipeline is taking longer, the units behind it cannot push the pipeline to move ahead until that unit is finished, even though they themselves are finished with their work. Like DPG said, try to look at it from the car manufacture line example. It is one of the most common ones given to explain a pipeline. If the units AHEAD of the slowest unit after finished quicker, it still doesn't matter because they have to wait until the slower unit finishes its work. So yes, your pipeline is executing 3 instructions for a total execution time of 30ns.
Thank you all for your answers
I think I have got it clear by now.
This is what I think the answer is -
Question- 1 :- Why pipeline execution dependent on the slowest step
Well clearly from the diagram each stage has to wait for the slowest stage to complete.
So the total time after which each instruction is completed bounded by the time of wait.
(In my example after a gap of 10 ns)
Question-2 :- Whats the total execution time of program
I wanted to know how long will the particular program containing 3 instructions take to execute
NOT how long will it take for 3 instructions to execute- which is obviously 30ns pertaining to the fact
that every instruction gets completed every 10 ns.
Now suppose I1 is fetched in the pipeline then already 4 other instructions are executing in it.
4 instructions are completed in 40 ns time.
After that I1,I2,I3 executed in order 30 ns.(Assuming no pipeline stalls)
This gives a total of 30+40=70ns .
In fact for n instruction program,k- stage pipeline
I think it is (n + k-1 ) *C *T
where C= no. of clk cycles in slowest stage
T= clock cycle time
Please review my understanding .... to know if I am thinking anything wrong and so that
I can accept my own answer!!!

How to set sun grid engine scheduling policy to satisfy this?

We use sun grid engine(actually open scheduler grid) as drms. Suppose we have 3 users: uA, uB, uC.
uA submit 100000 jobs then uB submit 10 jobs then uC submit 1 job. With default scheduling policy, grid engine will run uA's 100000 jobs and then uB's 10 and then uC's 1 job, thus uB and uC need to wait a long time.
We hope the scheduler can select jobs to run like this:
first, select 1 uA's job, 1 uB's job, 1 uC's job
then, select 19 uA's jobs, 19 uB's jobs
then, select uA's other jobs
How to set the policy to fit this?
I have done this by setting up a share tree policy with a single user = default. You also need to set a rapid half-life decay on prioritization (I used 1 hour). Also, place 0 priority weight on job waiting time. (Put 100% on the share tree policy.) I did this poking around in qmon and experimenting with different values.