Who actually carries out the scheduling in a system

Who actually carries out the scheduling in a system - operating-system

I came across that the process ready for execution in the ready queue are given the control of the CPU by the scheduler. The scheduler selects a process based on its scheduling algorithm and then gives the selected process the control of the CPU and later preempts if it is following a preemptive style. I would like to know that if the CPU's processing unit is being used by the processor then who exactly preempts and schedules the processes if the processing unit is not available.

now , i want to share you my thought about the OS,
and I'm sorry my English is not very fluent
What do you think about the OS? Do you think it's 'active'?
no, in my opinion , OS is just a pile of dead code in memory
and this dead code is constituted by interrupt handle function(We just called this dead code 'kernel source code')
ok, now, CPU is execute process A, and suddenly a 'interrupt' is occur, this 'interrupt' may occured because time clock or because a read system call, anyhow, a interrupt is occur. then CPU will jump the constitute interrupt handl function(CPU jump because CPU's constitute is designed). As said previously, this interrupt handle function is the part of OS kernel source code.
and CPU will execute this code. And what this code will do? this code will schedule，and CPU will execute this code.

Everything happens in the context of a process (Linux calls these lightweight processes but its the same).
Process scheduling generally occurs either as part of a system service call or as part of an interrupt.
In the case of a system service call, the process may determine it cannot execute so it invokes the scheduler to change the context to a new process.
The OS will schedule timer interrupts where it can do scheduling. Scheduling can also occur in other types of interrupts. Interrupts are handled by the current process.

Related

Where does the scheduler run?

Having just finished a book on comp. architecture, I find myself not completely clarified on where the scheduler is running.
What I'm looking to have clarified is where the scheduler is running - does it have it's own core assigned to run that and nothing else, or is the "scheduler" in fact just a more ambiguous algorithm, that it implemented in every thread being executed - ex. upon preemption of thread, a swithToFrom() command is run?
I don't need specifics according to windows x/linux x/mac os x, just in general.

No the scheduler is not run in it's own core. In fact multi-threading was common long before multi-core CPUs were common.
The best way to see how scheduler code interacts with thread code is to start with a simple, cooperative, single-core example.
Suppose thread A is running and thread B is waiting on an event. thread A posts that event, which causes thread B to become runnable. The event logic has to call the scheduler, and, for the purposes of this example, we assume that it decides to switch to thread B. At this point in time the call stack will look something like this:
thread_A_main()
post_event(...)
scheduler(...)
switch_threads(threadA, threadB)
switch_threads will save the CPU state on the stack, save thread A's stack pointer, and load the CPU stack pointer with the value of thread B's stack pointer. It will then load the rest of the CPU state from the stack, where the stack is now stack B. At this point, the call stack has become
thread_B_main()
wait_on_event(...)
scheduler(...)
switch_threads(threadB, threadC)
In other words, thread B has now woken up in the state it was in when it previously yielded control to thread C. When switch_threads() returns, it returns control to thread B.
These kind of manipulations of the stack pointer usually require some hand-coded assembler.
Add Interrupts
Thread B is running and a timer interrupts occurs. The call stack is now
thread_B_main()
foo() //something thread B was up to
interrupt_shell
timer_isr()
interrupt_shell is a special function. It is not called. It is preemptively invoked by the hardware. foo() did not call interrupt_shell, so when interrupt_shell returns control to foo(), it must restore the CPU state exactly. This is different from a normal function, which returns leaving the CPU state according to calling conventions. Since interrupt_shell follows different rules to those stated by the calling conventions, it too must be written in assembler.
The main job of interrupt_shell is to identify the source of the interrupt and call the appropriate interrupt service routine (ISR) which in this case is timer_isr(), then control is returned to the running thread.
Add preemptive thread switches
Suppose the timer_isr() decides that it's time for a time-slice. Thread D is to be given some CPU time
thread_B_main()
foo() //something thread B was up to
interrupt_shell
timer_isr()
scheduler()
Now, scheduler() can't call switch_threads() at this point because we are in interrupt context. However, it can be called soon after, usually as the last thing interrupt_shell does. This leaves the thread B stack saved in this state
thread_B_main()
foo() //something thread B was up to
interrupt_shell
switch_threads(threadB, threadD)
Add Deferred Service Routines
Some OSses do not allow you to do complex logic like scheduling from within ISRs. One solution is to use a deferred service routine (DSR) which runs as higher priority than threads but lower than interrupts. These are used so that while scheduler() still needs to be protected from being preempted by DSRs, ISRs can be executed without a problem. This reduces the number of places a kernel has to mask (switch off) interrupts to keep it's logic consistent.
I once ported some software from an OS that had DSRs to one that didn't. The simple solution to this was to create a "DSR thread" that ran higher priority than all other threads. The "DSR thread" simply replaces the DSR dispatcher that the other OS used.
Add traps
You may have observed in the examples I've given so far, we are calling the scheduler from both thread and interrupt contexts. There are two ways in and two ways out. It looks a bit weird but it does work. However, moving forward, we may want to isolate our thread code from our Kernel code, and we do this with traps. Here is the event posting redone with traps
thread_A_main()
post_event(...)
user_space_scheduler(...)
trap()
interrupt_shell
kernel_space_scheduler(...)
switch_threads(threadA, threadB)
A trap causes an interrupt or an interrupt-like event. On the ARM CPU they are known as "software interrupts" and this is a good description.
Now all calls to switch_threads() begin and end in interrupt context, which, incidentally usually happens in a special CPU mode. This is a step towards privilege separation.
As you can see, scheduling wasn't built in a day. You could go on:
Add a memory mapper
Add processes
Add multiple Cores
Add hyperthreading
Add virtualization
Happy reading!

Each core is separately running the kernel, and cooperates with other cores by reading / writing shared memory. One of the shared data structures maintained by the kernel is the list of tasks that are ready to run, and are just waiting for a timeslice to run in.
The kernel's process / thread scheduler runs on the core that needs to figure out what to do next. It's a distributed algorithm with no single decision-making thread.
Scheduling doesn't work by figuring out what task should run on which other CPU. It works by figuring out what this CPU should do now, based on which tasks are ready to run. This happens whenever a thread uses up its timeslice, or makes a system call that blocks. In Linux, even the kernel itself is pre-emptible, so a high-priority task can be run even in the middle of a system call that takes a lot of CPU time to handle. (e.g. checking the permissions on all the parent directories in an open("/a/b/c/d/e/f/g/h/file", ...), if they're hot in VFS cache so it doesn't block, just uses a lot of CPU time).
I'm not sure if this is done by having the directory-walking loop in (a function called by) open() "manually" call schedule() to see if the current thread should be pre-empted or not. Or maybe just that tasks waking up will have set some kind of hardware time to fire an interrupt, and the kernel in general is pre-emptible if compiled with CONFIG_PREEMPT.
There's an inter-processor interrupt mechanism to ask another core to schedule something on itself, so the above description is an over-simplification. (e.g. for Linux run_on to support RCU sync points, and TLB shootdowns when a thread on another core uses munmap). But it's true that there isn't one "master control program"; generally the kernel on each core decides what that core should be running. (By running the same schedule() function on a shared data-structure of tasks that are ready to run.)
The scheduler's decision-making is not always as simple as taking the task at the front of the queue: a good scheduler will try to avoid bouncing a thread from one core to another (because its data will be hot in the caches of the core it was last running on, if that was recent). So to avoid cache thrashing, a scheduler algorithm might choose not to run a ready task on the current core if it was just running on a different core, instead leaving it for that other core to get to later. That way a brief interrupt-handler or blocking system call wouldn't result in a CPU migration.
This is especially important in a NUMA system, where running on the "wrong" core will be slower long-term, even once the caches populate.

There are three types of general schedulers:
Job scheduler also known as the Long term scheduler.
Short term scheduler also known as the CPU scheduler.
Medium term scheduler, mostly used to swap jobs so there can be non-blocking calls. This is usually for not having too many I/O jobs or to little.
In an operating systems book it shows a nice automata of the states these schedulers go to and from. Job scheduler puts things from job queue to ready queue, the CPU scheduler takes things from ready queue to running state. The algorithm is just like any other software, it must be run on a cpu/core, it is most likely probably part of the kernel somewhere.
It doesn't make sense the scheduler can be preempted. The jobs inside the queue can be preempted when running, for I/O, etc. No the kernel does not have to schedule itself to allocate the task, it just gets cpu time without scheduling itself. And yes, most likely the data is in probably in ram, not sure if it is worth storing in the cpu cache.

What is the exact definition of 'process preemption'?

Wikipedia says:
In computing, preemption is the act of temporarily interrupting a task being carried out by a computer system, without requiring its cooperation, and with the intention of resuming the task at a later time.
Other sources say:
[...] preemption means forcefully taking away of the processor from one process and allocating it to another process. [Operating Systems (Self Edition 1.1), Sibsankar Haldar]
Preemption of a program occurs when an interrupt arises during its execution and the scheduler selects some other programs for execution. [Operating Systems: a Concept-based Approach, 2E, D. M. Dhamdhere]
So, what I understood is that we have process preemption if the process is interrupted (by a hardware interrupt, i.e. I/O interrupt or timer interrupt) and the scheduler, invoked after handling the interrupt, selects another process to run (according to the CPU scheduling algorithm). If the scheduler selects the interrupted process we have no process preemption (interrupts do not necessarily cause preemption).
But I found many other sources that define preemption in the following way:
Preemption is the forced deallocation of the CPU from a program. [Operating Systems: a Concept-based Approach, 2E, D. M. Dhamdhere]
You can see that the same book reports two different definitions of preemption. In the latter it is not mentioned that the CPU must be allocated to another process. According to this definition, preemption is just another name for 'interruption'. When a hardware interrupt arises, the process is interrupted (it switches from "Running" to "Ready" state) or preempted.
So my question is: which of the two definitions is correct? I'm quite confused.

The Wikipedia definition is pretty bad.The others are not so good. However, they are all saying essentially the same think.
Preemption is simply one of the means by which the operating system changes the process executing on a CPU.
Such a change can occur either through by the executing process voluntarily yielding the CPU or by the operating system preempting the executing process.
The mechanism for switching processes (context switch) is identical in both methods. The only difference is how the context switch is triggered.
A process can voluntarily yield the CPU when it no longer can execute. E.g. after doing I/O to disk (which will take a long time to complete). Some systems only support voluntary yielding (cooperative multitasking).
If a process is compute-bound, it would hog the CPU, no allowing other processes to execute. Most operating systems use a timer interrupt. If the interrupt handler finds that the current process has executed for at least a specified period of time and there are other processes that can execute the OS will switch processes.
Preemption is then a process (or thread) [context] switch on a CPU that is triggered by the operating system rather than by the process (or thread) itself.

What does the kernel do while another process is running

Consider this: When one task/process is running on a single processor system, another task has to wait for its turn till the first task is either suspended or terminates (depending on the scheduling algorithm).
Kernel also consists of various tasks that are using the using the same CPU to do OS related stuff - like scheduling, memory management, responding to system calls etc.
So when a kernel schedules a particular task/process to give it CPU time, does it relinquish its control over the CPU?ie does it momentarily stop? If not how does it continually keep on running to do all OS related tasks while the other process is running on CPU? Does the scheduler move aside to give the next task in line CPU and if so what brings the scheduler back to go on with further scheduling activities? This question is similar but it does not contain enough details -
How can kernel run all the time?
I am confused about this part and I cant understand how this would work.Can somebody please explain this in detail. It would be helpful if you could explain it with an example.

Yeah.. you should stop thinking of the OS kernel as a process and think of it instead of just code and data - a state-machine that processes/threads call in to in order to obtain specific services at one end, (eg. I/O requests) and drivers call in to at the other end to provide service solutions, (eg. I/O completion).
The kernel does not need any threads of execution in itself. It only runs when entered from syscalls, (interrupt-like calls from running user threads/processes), or drivers, (hardware interrupts from disk/NIC/KB/mouse etc hardware). Sometimes, such calls will change the set of threads running on the available cores, (eg. if a thread waiting for a network buffer becomes ready because the NIC driver has completed the action, the OS will probably try to assign it to a core 'immediately', preempting some other thread if required).
If there are no syscalls, and no hardware interrupts, the kernel does nothing because it is not entered - there is nothing for it to do.

What you are missing is that few operating systems these days have a monitor process as you are describing.
At the risk of gross oversimplification, operating systems run through exceptions and interrupts.
Assume you have two processes, P and Q. P is the running process and Q is the next to run. One way to switch processes is the system timer goes off triggering an interrupt. P switches to kernel mode and handles that interrupt. P runs the interrupt code handling the timer and determines that Q should run. P then saves its context and loads Q. At that moment, Q is the running process. The interrupt handler exits and picks up where Q was before.
In other words, process P becomes the kernel scheduler while the interrupt is being processed. Each process becomes the scheduler that loads the next process.
Another example, let us say that Q has queued a read operation to a disk. That operation completes and triggers an interrupt. P, the running process, enters kernel mode to handle the interrupt. P then processes Q's disk read operation.

How does a scheduler regain control when wanted?

I'm reading about scheduling, but I can't figure out how a scheduler regains control after it invokes code in the user space.
E.g. the scheduler passes the control to some app in the user space which does some infinite loop and no other hardware interrupt occurs on an one core chip. All documents talk about the scheduler regaining control and preemptivly interrupting the user process, but how does that work if the control is never passed back to the OS?
Question: Does the scheduler register with some clock in the CPU to be given control again after X msecs? Or is there some other trick? If no, what is the C function called to register for regular (or one time?) control regains?

On Windows the Sleep(0) "causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run". This forces the scheduler to gain control.
On Linux the sched_yield "causes the calling thread to relinquish the CPU". This also forces the scheduler to gain control.
And the scheduler also gains control by interrupts too. When a thread has consumed its quantum of CPU usage the scheduler reschedules.
Windows CE for example allows to customize the thread quantum.
You may also read Thread Scheduling: quanta, switching and scheduling algorithms.
There is no single scheduler in Windows. Event based scheduling code is spread across the kernel. The kernels dispatcher routines are triggered by these events:
Thread ready for execution
Thread quantum expired
Thread priority change
Thread processor affinity change
Wait functions and Sleep functions
This Microsoft presentation is summarizing some of the scheduler principles.

If no other interrupt occured, a preemptive O/S wouldn't despatch, and the user application would loop for ever.
This won't happen, though. Typically, a preemptive scheduler will despatch on every system call, every interrupt, and every tick of the system clock. The system clock will always interrupt, so your infinite loop simply won't occur.
The Pick operating system (after its developer Dick Pick) used a non-preemptive scheduler. Software developed for this system was required to make a system call periodically to allow the kernel to despatch other processes. In this environment the kernel would otherwise lose control completely until the process terminated.
The argument used in its justification was that considerable time was spent saving and restoring the processor state during a despatch. Forcing the application to take responsibility for this would allow a faster despatch process.

What happens in the CPU when there is no user code to run?

It sounds reasonable that the os/rtos would schedule an "Idle task". In that case, wouldn't it be power consuming? (it sounds reasonable that the idle task will execute: while (true) {} )

This depends on the OS and the CPU architecture. On x86 (Intel compatible) the operating system might execute HLT instructions, making the CPU wait until something interesting happens, such as a hardware interrupt. This supposedly consumes very little power. Operating systems report the time spent doing this as "idle" and may even assign it to a fictional "idle" process.
So, when in Windows task manager you see that the System Idle Process is consuming 90% CPU what it really means is that the CPU does not have an actual a program to run 90% of the time.
Here's a good article on the subject: What does an idle CPU do?

Historically it's been a lot of different schemes, especially before reducing power consumption in idle was an issue.
Generally there is an "idle" process/task that runs at the lowest priority and hence always gets control when there's nothing else to do. Many older systems would simply have this process run a "do forever" loop with nothing of consequence in the loop body. One OS I heard of would run machine diagnostics in the idle process. A number of early PCs would run a memory refresh routine (since memory needed to be cycled regularly or it would "evaporate").
(A benefit of this scheme is that 100% minus the % CPU used by the idle process gives you the % CPU utilization -- a feature that was appreciated by OS designers.)
But the norm on most modern systems is to either run a "halt" or "wait" instruction or have a special flag in the process control block that even more directly tells the processor to simply stop running and go into power-saving mode.

There's always code to run, the idle task is the code if there's nothing else. It may execute a special CPU instruction to power down the CPU until a hardware interrupt arrives. On x86 CPUs it's hlt (halt).

This answer is specific to Windows NT-based OS.
Idle thread functioality
Tasks may vary between architectures, but generally these are the tasks performed by idle threads:
Enable interrupts to allow pending interrupts be delivered
Disable interrupts (using STI or CLI instructions, more on wiki)
On the DEBUG (or checked) builds, query if a kernel debugger is attached and allow breakpoints if been requested
Handle deferred procedure calls
Check if there are any runnable threads ready for execution. If there is one, update the idle processor control block with a pointer to the thread
Check the queues of other processors, if possible schedule thread awaiting execution on the idle processor
Call a power management routine, which may halt a processor or downgrade CPU tick rate and do other similar power saving activities
Additional info
When there are no runnable threads for a logical processor, Windows executes a kernel-mode idle thread. There is only 1 Idle process that has as many idle threads as there are logical processors. So on a Quad core machine with 4 logical/physical processors, there will be 1 Idle process and 4 idle threads.
In Windows, Idle process has ID = 0, so do all the Idle threads. These objects are represented by standard EPROCESS/KPROCESS and ETHREAD/KTHREAD data structures. But they are not executive manager processes and threads objects. There are no user-land address space and no user-land code is executed..
Idle process is statically allocated at system boot time before the process manager and object manager are set up. Idle thread structures are allocated dynamically as logical processors are brought live.
Idle thread priority is set to 0. However, this value doesn't actually matter as this thread only gets executed when there are no other threads available to run. Idle thread priority is never compared with priority of any other threads.
Idle threads are also special cases for preemption. The idle thread main routine KiIdleLoop (implementation from reactos) performs several tasks that are not interrupted by other threads. When there are no runnable threads available to run on a processor, that processor is marked as idle in a processor control block. Then if a runnable threads arrives to the queue scheduled for execution, that thread's address pointer is stored in the NextThread pointer of the idle processor control block. During the run of an idle thread, this pointer address gets checked on every iteration inside a while loop.
Source: Windows Internals. M. Russinovich. 6-th edition. Part 1, p.453 - 456.