Round Robin scheduling and IO - operating-system

I'm unsure how Round Robin scheduling works with I/O Operations. I've learned that CPU bound processes are favoured by Round Robin scheduling, but what happens if a process finishes its time slice early?
Say we neglect the dispatching process itself and a process finishes its time slice early, will the scheduler schedule another process if its CPU bound, or will the current process start its IO operation, and since that isn't CPU bound, will immediately switch to another (CPU bound) process after? And if CPU bound processes are favoured, will the scheduler schedule ALL CPU bound process until they are finished and only afterwards schedule the I/O processes?
Please help me understand.

There are two distinct schedulers: the CPU (process/thread ...) scheduler, and the I/O scheduler(s).
CPU schedulers typically employ some hybrid algorithms, because they certainly do regularly encounter both pre-emption and processes which voluntarily give up part of their time-slice. They must service higher-priority work quickly, while not "starving" anyone. (A study of the current Linux scheduler is most interesting. There have been several.)
CPU schedulers identify processes as being either "primarily 'I/O-bound'" or "primarily 'CPU-bound'" at this particular time, knowing that their characteristics can and do change. If your process repeatedly consumes full time slices, it is seen as CPU-bound.
I/O schedulers seek to order and re-order the I/O request queues for maximum efficiency. For instance, to keep the read/write head of a physical disk-drive moving efficiently in a single direction. (The two components of disk-drive delay are "seek time" and "rotational latency," with "seek time" being by-far the worst of the two. Per contra, solid-state drives have very different timing.) I/O-schedulers also have to be aware of the channels (disk interface cards, cabling, etc.) that provide access to each device: they can't simply watch what any one drive is doing. As with the CPU-scheduler, requests must be efficiently handled but never "starved." Linux's I/O-schedulers are also readily available for your study.
"Pure round-robin," as a scheduling discipline, simply means that all requests have equal priority and will be serviced sequentially in the order that they were originally submitted. Very pretty birds though they are, you rarely encounter Pure Robins in real life.

Related

Throughput vs latency in computer architecture

I've come across articles on "through-put vs latency" in contexts like networking e.g. https://homepage.cs.uri.edu/~thenry/resources/unix_art/ch12s04.html But in the context of computer architecture / operating systems, I'm not able to understand why would there be a trade-off between latency (response time of a program) and through-put (how many programs we're able to complete in a unit of time, say per hour). Is this solely due to the fact that we can choose to parallelize processing of multiple programs / requests leading to overheads like context switches & sharing of caches which make the start-to-end response time per process to be worse? Or am I missing something here?
In terms of single instructions in a superscalar pipelined out-of-order exec CPU, throughput vs. latency is very important because the CPU is trying to extract parallelism from an instruction stream that has to be executed as if in serial program order. See Assembly - How to score a CPU instruction by latency and throughput and the bottom of my answer on latency vs throughput in intel intrinsics for example.
In terms of OS decisions that affect throughput vs. latency on a much longer timescale than a few clock cycles, that's a totally separate question.
One of the major factors there is choosing how to use the available physical RAM, and whether to page out (to a swap file) infrequently used code / data to make more room to cache disk files. (e.g. Linux's vm.swappiness is widely considered a key tunable in terms of setting it differently between servers and desktops. https://unix.stackexchange.com/questions/88693/why-is-swappiness-set-to-60-by-default).
If you alt-tab to a window when many pages of that process have been paged out, it will take some time before the process can redraw its window. (Multiple hard page faults, can be quite slow especially if paging on a rotational disk, not SSD.) So to optimize for latency, you want the kernel to not aggressively swap out pages from running processes, even if they've been idle for a few hours. Those pages, if they'd been free, could have improved throughput for other processes by acting as buffers / cache.
A related factor is I/O scheduling: trying to group IO requests together to minimize HD seek times (for higher throughput and lower average latency), but sometimes at the expense of delaying a few requests for a longer time (higher worst-case latency). Linux for example has many to choose from, including deadline, Completely Fair Queuing (CFQ), and the original elevator (just grouping requests by locality without consideration of fairness or latency). https://wiki.archlinux.org/title/improving_performance#Input/output_schedulers
CPU scheduling is also a factor: a context-switch hurts throughput, as it takes time itself and caches will likely be cold for the new task on this CPU. You also have to run the kernel's schedule() function to decide which task to run next, so that takes away some time from real work.
To minimize latency (for example between a socket message being sent to a process and it waking up when its poll or select system call returns), you want a short timeslice, like Linux HZ=1000. (Timer interrupts every 1 ms to run the scheduler). And you want to be able to pre-empt even the kernel itself, instead of waiting until the kernel is ready to return to the old user-space to consider the possibility of running a different user-space task.
But neither of these helps throughput, and in fact hurt (assuming the workload has enough parallelism to not bottleneck on latency). So HZ=100 was the default for "server" Linux builds, vs. 1000 on "desktop" builds tuned for interactive use. (Modern Linux can be "tickless", not using a fixed timer interrupt on every core at all, instead deciding when to schedule the next interrupt on a case by case basis.)
Real-time kernels take this even further, spending more time on finer-grained locking and stuff like that to enable pausing work and coming back to it later to minimize interrupt latency and other latencies between it being time to do something and actually starting to do that thing. (There are real-time patches for Linux, and there are also totally separate kernels built from the ground up for real-time operation.)
If you have an embedded system controlling a motor or something, you absolutely need hard real-time latency guarantees that it will never take longer than say 1 millisecond from an interrupt pin being asserted to the interrupt handler starting to run.
(Designing the system to make these guarantees possible often comes at the cost of throughput. e.g. obviously you have to pin some memory to make it not swappable, if we're talking about user-space, making it unavailable for cache even if it goes untouched for days.)

Will a process experience starvation?

Consider and operating system with a non-preemptive SJF schedule. If it is given a workload of say 10 processes, and each process performs a CPU burst which ranges from 10ms to 20ms followed by a 500ms I/O burst, will any of the processes experience starvation?
Working through this I know that the shortest process is scheduled first and whichever process is running will run to completion but I don't understand how to determine if any processes will be postponed due to a resource never being allocated given this information, I would like to know before continuing with it and I was wondering how I can tell given the workload and type of scheduler?
Consider and operating system with a non-preemptive SJF schedule. If it is given a workload of say 10 processes, and each process performs a CPU burst which ranges from 10ms to 20ms followed by a 500ms I/O burst, will any of the processes experience starvation?
If you define "starvation" as "perpetually not getting any CPU time"; then, with a "shortest job first" algorithm:
a) longer jobs will get starvation when shorter jobs are created faster than they complete (regardless of how many CPUs there are they literally can't keep up because new shorter jobs are being created too often).
b1) if the number of tasks that take an infinite amount of time exceeds the number of CPUs and none of the tasks block (e.g. wait for IO), or more processes will be starved of CPU time (unless you augment SJF with some some form of time sharing to avoid starvation among "always equal length" jobs).
b2) if the number of tasks that take an infinite amount of time exceeds the number of CPUs and some of the tasks do block (e.g. wait for IO), then whether starvation happens or not depends on "sum of time each process is not blocked".
If a SJF scheduler is given a workload of 10 processes and none of them are "infinite length", and no additional new processes are ever created; then all 10 tasks must complete sooner or later and none of the tasks will be perpetually waiting for a CPU.
Of course this doesn't mean some tasks won't have to wait (temporarily, briefly) for a CPU.
Note: for real systems, typically there are lots of infinite length tasks that do block (e.g. for both Windows and Linux, often there's over 100 processes running as services/daemons and for GUI); nobody knows how long any task will take (and not just because the speed of each CPU keeps changing due to power management - e.g. how long will the web browser you're using run for?); often nobody can know if a process will take an infinite amount of time or not (halting problem); and sometimes a process will accidentally loop forever due to a bug. In other words; "shortest job first" is almost always impossible to implement.

Can Multiprocessor CPUs avoid context-switching?

Today's computer architecture are trying to maximize the number of registers. It is faster to access a register (which is an integrated memory circuit near the cpu) than to access first-level cache. The problem is, that each context switch has to save all registers into cache, because the next thread needs other register values. What a modern CPU is doing is to cycle in one second through 100 tasks and everytime it saves the registers, and fetches the old one until the task can be started.
IMHO it would be nice to use one CPU for one task, and no context switching is happening. That means we get 100 CPUs, each 1000 registers which has to be never saved. Is that possible or have I a ignored an important detail?
The only way to completely avoid context switching is by having at least as many cores as there are tasks. Generally, there is no guarantee regarding the maximum number of tasks that may run. Current GPUs and manycore processors and co-processors contain hundreds of small cores. If you put multiple of these things in the same system or in a cluster of systems, you can have thousands or more cores. Still, even if you could avoid context switching with such design, these cores are much slower than the traditional high-end CPU cores, so the net effect might be negative.
But let's take a step back here. The number of context switches is not primarily determined by the number of tasks and cores. Tasks don't just perform computations, they also need to interact with I/O devices and wait for things to happen such as results from other tasks or user input. So some tasks would be in a wait state. The overhead of context switching depends on not only the number of tasks but also the behavior of these tasks.
Both processors architects and OS developers are aware of context switching overhead and employ a variety of techniques to alleviate it. For example, x86 provides a number of instructions that are tuned to saving the context (partially) of the current task. The OS thread scheduler uses techniques such as priorities, preemption (with possibly large time slices on servers), and priority boosting. All of these help reducing the number of context switches and therefore their overall overhead. In addition, reducing the overhead of context switching is not the only thing that matters. In particular, the responsiveness of the system is very important as well, which is at odds with that overhead.

What are some of the advantages and disadvantages of user mode and kernel mode

In an Operating System, threads are typically handled in user mode or kernel mode. What are some of the advantages and disadvantages of each?
User-mode threads are scheduled in user mode by something in the process, and the process itself is the only thing handled by the kernel scheduler.
That means your process gets a certain amount of grunt from the CPU and you have to share it amongst all your user mode threads.
Simple case, you have two processes, one with a single thread and one with a hundred threads.
With a simplistic kernel scheduling policy, the thread in the single-thread process gets 50% of the CPU and each thread in the hundred-thread process gets 0.5% each.
With kernel mode threads, the kernel itself manages your threads and schedules them independently. Using the same simplistic scheduler, each thread would get just a touch under 1% of the CPU grunt (101 threads to share the 100% of CPU).
In an Operating System, threads are typically handled in user mode or kernel mode.
Typically threads are handled in kernel mode.
What are some of the advantages and disadvantages of each?
In theory, the advantage of handling threads in user mode is that it avoids the cost of switching to/from kernel when a thread needs to wait for something (which can be relatively expensive as it involves privilege level switches). In practice this "advantage" often doesn't happen because the thread has to switch to kernel anyway, to ask kernel to do whatever the thread would wait for (e.g. switching to kernel to ask it to read data from a file and then returning to user-space to block/wait instead of blocking/waiting in the kernel while you're already in the kernel). Mostly; it only helps if the kernel isn't involved at all, which only really happens when user-space threads communicate with or share locks with other threads in the same process.
The advantage of handling threads in kernel is that the kernel can support thread priorities properly. For example, if you have two processes that both have a very high priority thread and a very low priority thread; then kernel can make sure CPU time is given to the high priority thread/s when possible (including pre-empting low priority threads when a high priority thread unblocks) because it knows about all threads; but user-space can't do this - one process doesn't know about threads belonging to a different process, so user threading will get it wrong and ruin performance (one process giving CPU time to its own very low priority thread while a very high priority thread belonging to a different process needs the CPU and doesn't get it).
The other advantage of handling threads in the kernel is that (especially for systems with multiple CPUs) the kernel has access to better information and can make smarter scheduling decisions. This includes balancing the load (from any number of processes) across all CPUs while taking into account "CPU topology" (NUMA, SMT, etc; possibly including heterogeneous CPUs - e.g. "big.LITTLE" arrangements); and making trade-offs between thread priorities, CPU temperatures and power consumption (e.g. if one of the CPU's is getting too hot, reduce that CPU's clock speed to let it cool down and use it for low priority threads so that the performance of high priority threads isn't effected).

Why schedule processes?

I am trying to understand the following reasons of why we must schedule processes:
❖ Bursts of CPU usage alternate with periods of I/O wait
❖ Some processes are CPU-bound: they don’t make many I/O requests
❖ Other processes are I/O-bound and make many kernel requests
I am little confused about the terminology being used in this case. They refer to bursts of CPU usage in periods of I/O wait. What does that mean?. Does that mean when some set of instructions is being executed and these are blocked by an unexpected I/O resource, somehow these instructions remain idle in the CPU?.