How scheduler knows a Task is in blocking state? - rtos

I am reading "Embedded Software Primer" by David E.Simon.
In it discusses RTOS and its building blocks Scheduler and Task. It says each Task is either in Ready State, Running State, or Blocking State. My question is how the scheduler determines a Task is in Blocking State? Assume it's waiting for a Semaphore. Then it likely Semaphore is in a state it can't return. Does Scheduler see if a function does not return, then mark its state as Blocking?

The implementation details will vary by RTOS. Generally, each task has a state variable that identifies whether the task is ready, running, or blocked. The scheduler simply reads the task's state variable to determine whether the task is blocked.
Each task has a set of parameters that determine the state and context of the task. These parameters are often stored in a struct and called the "task control block" (although the implementation varies by RTOS). The ready/run/block state variable may be a part of the task control block.
When the task attempts to get the semaphore and the semaphore is not available then the task will be set to the blocked state. More specifically, the semaphore-get function will change the task from running to blocked. And then the scheduler will be called to determine which task should run next. The scheduler will read through the task state variables and will not run those tasks that are blocked.
When another task eventually sets the semaphore then the task that is blocked on the semaphore will be changed from the blocked to the ready state and the scheduler may be called to determine if a context switch should occur.

As I'm writing a RTOS ( http://distortos.org/ ), I thought that I may chime in.
The variable which holds the state of each thread is indeed usually implemented in RTOSes, and this includes mine version:
https://github.com/DISTORTEC/distortos/blob/master/include/distortos/ThreadState.hpp#L26
https://github.com/DISTORTEC/distortos/blob/master/include/distortos/internal/scheduler/ThreadControlBlock.hpp#L329
However this variable usually is used only as a debugging aid or for additional checks (like preventing you from starting a thread that is already started).
In RTOSes targeted at deeply embedded systems the distinction between ready/blocked is usually made using the containers that hold the threads. Usually the threads are "chained" in linked lists, usually also sorted by priority and insertion time. The scheduler has its own list of threads that are "ready" ( https://github.com/DISTORTEC/distortos/blob/master/include/distortos/internal/scheduler/Scheduler.hpp#L340 ). Each synchronization object (like a semaphore) also has its own list of threads which are "blocked" waiting for this object ( https://github.com/DISTORTEC/distortos/blob/master/include/distortos/Semaphore.hpp#L244 ) . When a thread attempts to use a semaphore that is currently not available, it is simply moved from the scheduler's "ready" list to semaphores's "blocked" list ( https://github.com/DISTORTEC/distortos/blob/master/source/synchronization/Semaphore.cpp#L82 ). The scheduler doesn't need to decide anything, as now - from scheduler's perspective - this thread is just gone. When this semaphore is now released by another thread, first thread which was waiting on this semaphore's "blocked" list is moved back to scheduler's "ready" list ( https://github.com/DISTORTEC/distortos/blob/master/source/synchronization/Semaphore.cpp#L39 ).
Usually there's no need to make special distinction between threads that are ready and the thread that is actually running. As the amount of threads that can actually run is fixed and equal to the number of available CPU cores, then all you need is a pointer for each CPU core which points to the thread from the "ready" list which is running at that core at that moment. In my system I do the same - the thread that is at the head of the "ready" list is the one that is running, but I also manage an iterator which points to that thread ( https://github.com/DISTORTEC/distortos/blob/master/include/distortos/internal/scheduler/Scheduler.hpp#L337 ). You could have a separate list for running threads, but in most cases it would be a waste of space (there's usually just one) and makes other things slightly more complicated.
I've actually wrote an article about thread states and their transitions if you're interested - http://distortos.org/documentation/task-states/ This article has no special distinction between the thread that is "ready" and the one that is actually running. I don't consider this distinction to be actually useful for anything, as long as you have other means to tell which of the "ready" threads is running.

Related

What happens to a process and/or thread while it waits on a mutex?

When a process and/or thread waits on mutex in which state the process and/or thread is? Is it in WAIT or READY or some other state? I tried to search the answer over the web but could not find a clear, definitive answer, maybe there isn't one or maybe there is, to find out that I am posting this question here.
tl;dr: Nothing happens when it is waiting; it is simply a kernel data structure.
Without loss of generality, all operating systems have some sort of model where a unit of execution (task) moves between the states : Ready, Running, Waiting. This Task has a data structure associated with it, where its state and registers (among other things) are recorded.
When a task moves from Ready to Running, its saved registers are loaded on a cpu, and it continues execution from its last saved state. Initially, its saved registers are set to reasonable values for the program to start.
From Running to Waiting or Ready, its registers are stored in its task data structure, and this structure is placed on either a list of Ready or Waiting tasks.
From Waiting to Ready, the task data structure is removed from the Waiting list and appended to the Ready list.
When a task tries to acquire a mutex that is unavailable, it moves from Running (how else could it try to get the mutex) to Waiting. If the mutex was available, it remains Running, and the mutex becomes unavailable.
When a task releases a mutex, and another task is Waiting for that mutex, the Waiting task becomes Ready, and acquires the mutex. If many tasks are Waiting, one is chosen to acquire the mutex and become Ready, the rest remain Waiting.
This is a very abstract description; real systems are complicated by both a plurality of synchronization mechanisms (mailbox, queue, semaphore, pipe, ...), a desire to optimise the various paths, and the utilization of multiple CPUs.

Where does the scheduler run?

Having just finished a book on comp. architecture, I find myself not completely clarified on where the scheduler is running.
What I'm looking to have clarified is where the scheduler is running - does it have it's own core assigned to run that and nothing else, or is the "scheduler" in fact just a more ambiguous algorithm, that it implemented in every thread being executed - ex. upon preemption of thread, a swithToFrom() command is run?
I don't need specifics according to windows x/linux x/mac os x, just in general.
No the scheduler is not run in it's own core. In fact multi-threading was common long before multi-core CPUs were common.
The best way to see how scheduler code interacts with thread code is to start with a simple, cooperative, single-core example.
Suppose thread A is running and thread B is waiting on an event. thread A posts that event, which causes thread B to become runnable. The event logic has to call the scheduler, and, for the purposes of this example, we assume that it decides to switch to thread B. At this point in time the call stack will look something like this:
thread_A_main()
post_event(...)
scheduler(...)
switch_threads(threadA, threadB)
switch_threads will save the CPU state on the stack, save thread A's stack pointer, and load the CPU stack pointer with the value of thread B's stack pointer. It will then load the rest of the CPU state from the stack, where the stack is now stack B. At this point, the call stack has become
thread_B_main()
wait_on_event(...)
scheduler(...)
switch_threads(threadB, threadC)
In other words, thread B has now woken up in the state it was in when it previously yielded control to thread C. When switch_threads() returns, it returns control to thread B.
These kind of manipulations of the stack pointer usually require some hand-coded assembler.
Add Interrupts
Thread B is running and a timer interrupts occurs. The call stack is now
thread_B_main()
foo() //something thread B was up to
interrupt_shell
timer_isr()
interrupt_shell is a special function. It is not called. It is preemptively invoked by the hardware. foo() did not call interrupt_shell, so when interrupt_shell returns control to foo(), it must restore the CPU state exactly. This is different from a normal function, which returns leaving the CPU state according to calling conventions. Since interrupt_shell follows different rules to those stated by the calling conventions, it too must be written in assembler.
The main job of interrupt_shell is to identify the source of the interrupt and call the appropriate interrupt service routine (ISR) which in this case is timer_isr(), then control is returned to the running thread.
Add preemptive thread switches
Suppose the timer_isr() decides that it's time for a time-slice. Thread D is to be given some CPU time
thread_B_main()
foo() //something thread B was up to
interrupt_shell
timer_isr()
scheduler()
Now, scheduler() can't call switch_threads() at this point because we are in interrupt context. However, it can be called soon after, usually as the last thing interrupt_shell does. This leaves the thread B stack saved in this state
thread_B_main()
foo() //something thread B was up to
interrupt_shell
switch_threads(threadB, threadD)
Add Deferred Service Routines
Some OSses do not allow you to do complex logic like scheduling from within ISRs. One solution is to use a deferred service routine (DSR) which runs as higher priority than threads but lower than interrupts. These are used so that while scheduler() still needs to be protected from being preempted by DSRs, ISRs can be executed without a problem. This reduces the number of places a kernel has to mask (switch off) interrupts to keep it's logic consistent.
I once ported some software from an OS that had DSRs to one that didn't. The simple solution to this was to create a "DSR thread" that ran higher priority than all other threads. The "DSR thread" simply replaces the DSR dispatcher that the other OS used.
Add traps
You may have observed in the examples I've given so far, we are calling the scheduler from both thread and interrupt contexts. There are two ways in and two ways out. It looks a bit weird but it does work. However, moving forward, we may want to isolate our thread code from our Kernel code, and we do this with traps. Here is the event posting redone with traps
thread_A_main()
post_event(...)
user_space_scheduler(...)
trap()
interrupt_shell
kernel_space_scheduler(...)
switch_threads(threadA, threadB)
A trap causes an interrupt or an interrupt-like event. On the ARM CPU they are known as "software interrupts" and this is a good description.
Now all calls to switch_threads() begin and end in interrupt context, which, incidentally usually happens in a special CPU mode. This is a step towards privilege separation.
As you can see, scheduling wasn't built in a day. You could go on:
Add a memory mapper
Add processes
Add multiple Cores
Add hyperthreading
Add virtualization
Happy reading!
Each core is separately running the kernel, and cooperates with other cores by reading / writing shared memory. One of the shared data structures maintained by the kernel is the list of tasks that are ready to run, and are just waiting for a timeslice to run in.
The kernel's process / thread scheduler runs on the core that needs to figure out what to do next. It's a distributed algorithm with no single decision-making thread.
Scheduling doesn't work by figuring out what task should run on which other CPU. It works by figuring out what this CPU should do now, based on which tasks are ready to run. This happens whenever a thread uses up its timeslice, or makes a system call that blocks. In Linux, even the kernel itself is pre-emptible, so a high-priority task can be run even in the middle of a system call that takes a lot of CPU time to handle. (e.g. checking the permissions on all the parent directories in an open("/a/b/c/d/e/f/g/h/file", ...), if they're hot in VFS cache so it doesn't block, just uses a lot of CPU time).
I'm not sure if this is done by having the directory-walking loop in (a function called by) open() "manually" call schedule() to see if the current thread should be pre-empted or not. Or maybe just that tasks waking up will have set some kind of hardware time to fire an interrupt, and the kernel in general is pre-emptible if compiled with CONFIG_PREEMPT.
There's an inter-processor interrupt mechanism to ask another core to schedule something on itself, so the above description is an over-simplification. (e.g. for Linux run_on to support RCU sync points, and TLB shootdowns when a thread on another core uses munmap). But it's true that there isn't one "master control program"; generally the kernel on each core decides what that core should be running. (By running the same schedule() function on a shared data-structure of tasks that are ready to run.)
The scheduler's decision-making is not always as simple as taking the task at the front of the queue: a good scheduler will try to avoid bouncing a thread from one core to another (because its data will be hot in the caches of the core it was last running on, if that was recent). So to avoid cache thrashing, a scheduler algorithm might choose not to run a ready task on the current core if it was just running on a different core, instead leaving it for that other core to get to later. That way a brief interrupt-handler or blocking system call wouldn't result in a CPU migration.
This is especially important in a NUMA system, where running on the "wrong" core will be slower long-term, even once the caches populate.
There are three types of general schedulers:
Job scheduler also known as the Long term scheduler.
Short term scheduler also known as the CPU scheduler.
Medium term scheduler, mostly used to swap jobs so there can be non-blocking calls. This is usually for not having too many I/O jobs or to little.
In an operating systems book it shows a nice automata of the states these schedulers go to and from. Job scheduler puts things from job queue to ready queue, the CPU scheduler takes things from ready queue to running state. The algorithm is just like any other software, it must be run on a cpu/core, it is most likely probably part of the kernel somewhere.
It doesn't make sense the scheduler can be preempted. The jobs inside the queue can be preempted when running, for I/O, etc. No the kernel does not have to schedule itself to allocate the task, it just gets cpu time without scheduling itself. And yes, most likely the data is in probably in ram, not sure if it is worth storing in the cpu cache.

Idle time in RTOS

Since idle tasks are generally used to safely consume CPU time that is not required by other software, what would happen if there was no idle task? Would the RTOS just automatically create one? Also, what other purpose do idle tasks serve other than consuming time?
what would happen if there was no idle task? Would the RTOS just automatically create one?
I doubt there is any RTOS that would do that. If there would be no idle task, then the list of runnable tasks would be empty and the scheduler would probably crash. Generally the single most important reason for idle thread's existence is to make the list of runnable tasks "never empty". This simplifies the code of scheduler.
Also, what other purpose do idle tasks serve other than consuming time?
In some systems idle task can perform some low priority activities (for example some garbage collection). It can also switch the core to low-power mode, especially on embedded devices. In that case when the idle task is run it means that there is nothing more to do, so the core can be stopped and wait for the next event (hardware interrupt or timeout) without using too much power. When the next event arrives the core is awakened by hardware and the event is processed. Either some "normal" thread will start running, or - if there is still nothing more to do - idle thread will resume and again switch to low-power mode.
If the CPU clock is running, instructions must be executed; if there were no idle task, then your OS is broken. The idle loop is an intrinsic part of the RTOS, not a user task, so the RTOS does not need to "create one automatically".
A low priority user task that never yields will prevent the idle loop from running; which is not necessarily a good thing. Such a task is not the same thing as the idle loop. For one thing any CPU usage tools the RTOS supports would report 100% usage all the time if such a task eusted - execution of the idle loop is not included is CPU usage because the CPU is always ready to respond to any interrupt event when idle - the loop does not ever cause any ready task to be delayed.
The idle task, or "idle loop" is typically just that, and empty loop that the program counter is set to when there is nothing else to do. In some architectures the loop may include a "wait-for-interrupt" instruction that stops core execution (stops clocking the core) to reduce power consumption. Since any context switch necessarily requires an interrupt to occur, the processor can if WFI is supported just stop in this loop.
Some RTOS support user hooks for the idle loop; low-priority run-to-completion functions that can operate in the background in the idle loop context.
what other purpose do idle tasks serve other than consuming time?
Most commonly, it does two things:
1. Garbage(resource) collection or cleaning
2. Initiate steps to reduce power consumption

Round-robin scheduling algorithm

I'm studying operating system on this book and my prof's slides. I'm arrived at the "Process scheduling algorithms" chapter. Talking about the RoundRobin (RR) algorithm i found some inconsistencies. I understand that is a preemptive version of the FCFS algorithm with a time-slice (quatum).
From now on, I will use the following notation:
#1 = prof's version
#2 = book's version
#3 = other version
Here's the inconsistency (suppose a quantum of 100ms):
#1 The RR uses two queue (Q1, Q2):
Q1: queue for processes that did not end their quantum;
Q2: queue for processes that did end their quantum;
The scheduler takes the process from the head of Q1;
If the process ends before the quantum expires, the process release the CPU on purpose and the scheduler takes the next process from Q1
If the process doesn't end before the quantum expires, is preempted and placed at the end of Q2;
When a process is ready it's placed at the end of Q1;
When Q1 is empty, Q1 and Q2 are swapped;
So when a process is blocked for an I/O request (e.g. after 30ms) and its quantum is not expired yet, is placed at the end of Q1 (I guess) and when it will be scheduled again it will use the CPU for his remaining time (70ms in this case).
#2 (The book did not talk about multiple queues, so I assume it will use just one queue)
The scheduler takes the process from the head of the ready queue;
If the process ends before the quantum expires, the process release the CPU on purpose and the scheduler takes the next process from the ready queue
If the process doesn't end before the quantum expires, is preempted and placed at the end of the ready queue;
#3 Source
The scheduler takes the first process in the ready queue;
If the process ends before the quantum expires, the process release the CPU on purpose and the scheduler takes the next process from the ready queue
If the process doesn't end before the quantum expires, is preempted and placed at the end of the ready queue;
If the process is blocked by a I/O request, it's placed in a waiting queue and when it will became ready will be placed again in the ready queue;
To me, these are 3 different implementations of the RR scheduling algorithm. I think that the most valuable is the #3 because the #1 can cause a starvation (if the process is placed in Q2 and new processes keep coming in Q1, then the process will never be scheduled again) and #2 will waste CPU time when a process is blocked for an I/O request. So, my question is: which one is the right one?
Round robin in theory
Round Robin scheduling can be quite good visualized when thinking about an analog clock: The hand is turning around at constant speed, so it's in the slice of a single digit for 1/12 of the time it takes for one complete run.
A single digit thus has some slice of the total available amount of resource. And, most importantly, there's a fixed order in which the digits get served: After the hand just passed some digit, it'll only get visited again after the hand passed all the other digits.
Looking at the variants you presented, number #2 the version from the book matches this exactly: After a task has been served it is put at the end of a (often so called) ready queue, and thus only gets served again after all of the other tasks have been served once.
Round robin is, as a theoretical scheduling algorithm, only considering the scheduling of multiple consumers (tasks) to a single resource (CPU).
Variants
Some common variations of the basic round robin scheduling are to either use different slice sizes for different tasks, or to dynamically adjust the slice of a task based on some metric, or even to provide more than a single slice to some tasks.
In practice
When you are scheduling tasks, you have to schedule them to more than the CPU as single resource, there will be other resources that need to be managed, like IO devices.
Very simple schedulers just ignore that fact, and leave tasks that are currently waiting for some other resource in the task queue for the CPU.
So when such a task gets its time slice, all it'll do is find that it still needs to wait for that other resource and hands back the CPU, just to get put back into the task queue by the scheduler. Starting the task, checking that the task still needs to wait for the other resource, and stopping the task takes some time that could better be spend for a task that actually can use the CPU.
To solve this, one usually has a task queue for each resource that is managed, i.e. one for the CPU, one for each IO device, etc. When a task is doing a blocking IO call, it is then removed from the queue for the CPU and put into the queue of the device it is accessing. This way, tasks that are waiting for a resource other than the CPU don't sit in the CPU task queue (and thus waste no time getting started and immediately stopped again).
This is what #3 is talking about (when you look again you see that they're talking about "waiting queues")
Another kind of "waiting queue"
Often there's also a "waiting queue" when you're talking about multi level schedulers: In that case, the ready queue is the queue of tasks that are ready and get scheduled by the primary scheduler (using round robin, for example). If a task blocks because of an IO operation, it is - as described above - put into the queue of the corresponding resource. When that resource gets available again, the task is first put into the waiting queue, from which a secondary scheduler (which is just a task for the primary scheduler) eventually takes it and puts it into the ready queue of the primary scheduler.
What your prof is about
The version #1 is probably better to understand if you rename the queues into something like "queue with tasks that did not yet run in this turn" and "queue with tasks that did already run in this turn". This is basically just a "workaround" if you don't want to have circular lists. So it's round robin, too, but a bit obscured.
[..] can cause a starvation (if the process is placed in Q2 and new processes keep coming in Q1, then the process will never be scheduled again) [..]
This is a very good observation. This could be solved if new, ready tasks get inserted at the end of Q2 instead of Q1. If suitable, this is a very good start for a discussion when your prof is asking for questions.
The first implementation can not only cause starvation, it can also cause deadlocks. If only we modify the step 5 of first implementation, the method can be made just fine.
The second approach is round-robin in its purest form.
The third approach is not round-robin but is in fact a smarter version which understands that I/O bound process should not be given another chance too soon as it will probably not be ready yet.
If you will continue reading that book you will read the next scheduling algorithm called Multi-level Queue. That is actually the best of all implementations of different variations of Round-robin. In Multi-level queue we put all incoming processes in one queue and then depending upon whether they finished in their first time slice or not, put them in another queue. This new queue has higher time slice and ends up holding CPU bound processes.
Using Multi-level queues (like 4-5 of them) the CPU combs out all the incoming processes into various classes and then picks optimum number of processes from each queue so that it is neither over-subscribed nor under-subscribed.
Any process which arrives in the system is queue at the end of the ready queue. A process which is at the head of the ready queue is selected and allowed to execute on the CPU for a time quantum q.At the expiry of q, the process is queued at the tail of the ready queue.The next process scheduled is the one at the head of ready queue.This is know Round Robin Scheduling.
OR
In round robin scheduling, processes are dispatched FIFO but are given a limited amount of CPU time called a time-slice or a quantum.If a process does not complete before its CPU times expires,the CPU is preempted and given to the next waiting process,the preempted process is than placed at the back of the ready queue.

Who schedules the scheduler in OS - Isn't it a chicken and egg scenario?

Who schedules the scheduler?
Which is the first task created and how is this first task created? Isn't any resource or memory required for it? isn't like a chicken and egg scenario?
Isn't scheduler a task? Does it get the CPU at the end of each time slice to check which task needs to be given CPU?
Are there any good links which makes a person think and understand deeply all these concepts rather than spilling out some theory which needs to be byhearted?
The scheduler is scheduled by
an (external) event such as an interrupt, (disk done, mouse click, timer tick)
or an internal event (such as the completion of a thread, the signalling by a thread that it needs to wait for something, or the signalling of a thread that it has released a resource, or a trap caused by a thread doing something illegal like division by zero)
In short, it is triggered by any event that might require that the set of tasks to be run and/or the priorities of those tasks to be reevaluated. The scheduler decides which task(s) run next, and passes control to the next task.
Typically, this "scheduling" of the scheduler is caused by the code associated with a hardware interrupt, or code associated with a system call.
While you can think of the scheduler as being a real thread, in practice it doesn't need to be implemented that way... because it is executed with higher priority than any other task. Sophisticated OSes may in fact set aside a special thread that is the scheduler, and mark it busy when the scheduler gets control. That makes it pretty, but the bogus thread isn't scheduled by the scheduler
One can have multiple schedulers: the highest priority one (e.g., the one we just described), and other schedulers which really are threads, and are run like other user tasks. Such lower priority schedulers tend to be used to manage actions which occur at much longer intervals, such as background jobs.
it is usually invoked periodically by a timed CPU interrupt