What is the difference between OS privilege levels and privilege levels of the underlying hardware? Do all system calls cause a trap to the kernel? Why do system calls cause a trap? Is it because of privileged instructions such as IN in their assembly code?
To answer your questions directly:
What is the difference between OS privilege levels and privilege levels of the underlying hardware?
The privileges that must be enforced on a code level (ie privileged instructions) must be supported in hardware. Many OS security levels (ie permissions to access a specific hardware device) does not require hardware support for authenticating that very specific function, but it does at least require that there be hardware support to block code from accessing the device. So, in short, the ability of the OS to implement privilege levels depends on the underlying hardware, but the two need not be the same.
Do all system calls cause a trap to the kernel? Why do system calls cause a trap? Is it because of privileged instructions such as IN in their assembly code?
That's essentially correct. Trapping is the natural mechanism by which unprivileged code can transition to a privileged level. If, for example, a user program needs to access some piece of hardware, then IN and OUT are privileged so it must 'trap' to the Kernel and then the kernel will perform the required operations and return.
Related
I read that there are some privileged instructions in our system that can be executed in kernel mode. But I am unable to understand who make these instructions privileged . Is it the hardware manufacturer that hardwire some harmful instructions as privileged with the help of mode bit or is it the OS designers that make instructions privileged make them work only in privileged mode.
Kernel vs. user mode, and which instructions aren't allowed in user mode, is part of the ISA. That's baked in to the hardware.
CPU architects usually have a pretty good idea of what OSes need to do and want to prevent user-space from doing, so these choices at least make privilege levels possible, i.e. make it impossible for user-space to simply take over the machine.
But that's not the full picture: on some ISAs, such as x86, later ISA extensions have added control-register flag bits that let the OS choose whether some other instructions are privileged or not. On x86 that's done for instructions that could leak information about kernel ASLR, or make timing side-channels easier.
For example, rdpmc (read performance monitor counter) can only be used from user-space if specially enabled by the kernel. rdtsc (Read TimeStamp Counter) can be read from user-space by default, but the TSD (TimeStamp Disable) flag in CR4 can restrict its use to priv level 0 (kernel mode). Stopping user-space from using high-resolution timing is a brute-force way of defending against timing side-channel attacks.
Another x86 extension defends against leaking kernel addresses to make kernel ASLR more secret; CR4.UMIP (User Mode Instruction Prevention) disables instructions like sgdt that reads the virtual address of the GDT. Those instructions were basically useless for user-space in the first place, and unlike rdtsc easily could always have been privileged.
The Linux Kernel option to enable use of this extension describes it:
The User Mode Instruction Prevention (UMIP) is a security feature in newer Intel processors. If enabled, a general protection fault is issued if the SGDT, SLDT, SIDT, SMSW or STR instructions are executed in user mode. These instructions unnecessarily expose information about the hardware state.
The vast majority of applications do not use these instructions. For the very few that do, software emulation is provided in specific cases in protected and virtual-8086 modes. Emulated results are dummy.
Setting a new address for the IDT/GDT/LDT (e.g. lgdt/lidt) is of course a privileged instruction; those let you take over the machine. But until kernel ASLR was a thing, there wasn't any reason to stop user-space from reading the address. It could be in a page that had its page-table entry set to kernel only, preventing user-space from doing anything with that address. (... until Meltdown made it possible for user-space to use a speculative side-channel to read data from kernel-only pages that were hot in cache.)
In hardware assisted virtualization, guest operation system runs on Ring 0, therefore it can run privileged instruction directly, am I right?
So why in full virtualization, VMM just won't run guest privileged instructions on Ring 0? why we need to translate them?
One reason which come into mind is different architectures (Different guest and host). is there anything more?
therefore it can run privileged instruction directly, am I right?
No, it is not completely true. Privileged instructions would still attempt accessing privileged resources and thus cannot be allowed to see/change them behind VMM's back. Therefore they would trap. That is why a classic VMM executes guests with "trap-and-emulate" approach. The majority of guest instructions that are non-privileged are executed directly, and privileged ones trap and a emulated one-by-one. No translation, that is, transformation of large (>1 guest instruction) blocks of the code is required in any case.
Alternatively, a system resource can be made non-privileged and thus instructions accessing it turn into innocuous inside the virtualized environment.
So why in full virtualization, VMM just won't run guest privileged instructions on Ring 0?
"Ring 0" is just a number, it does not mean much except that certain instructions receive new semantics: instead of faulting as they would do on the higher rings they are allowed to access system resources. But inside a VMM, they are not allowed to do that.
why we need to translate them?
We don't, individual privileged instructions may be trapped and then emulated, or interpreted. "Translation" as a term has meaning only for blocks of instructions.
One reason which come into mind is different architectures
That is a some sort of a degenerative case when 100% of guest instructions are "privileged", i.e. they will not behave as expected on chosen host. It does not make sense to attempt executing them directly, and interpreting each and every of them is too slow for many applications. This is where translation == compilation of bigger blocks starts making sense.
is there anything more?
For Intel architecture, there are certain architectural idiosyncrasies that sometimes make the idea of (temporarily) disabling hardware-assisted virtualization and falling back to binary translation beneficial in terms of speed and correctness. However, I assume this topic to be part of another, more specific question, as the answer is quite involved and requires deep understanding of Intel VT-x.
I read this paragraph from " Modern Operating Systems , Tanenbaum "
Most computers have two modes of operation: kernel
mode and user mode. The operating system is the most fundamental piece of software and runs in kernel mode (also called supervisor mode). In this mode it has complete access to all the hardware and can execute any instruction the machine is capable of executing. The rest of the software runs in user mode, in which only a subset of the machine instructions is available.
I am unable to get how they are describing difference in these two modes on basis of machine instructions available , at user end any software has the capability to make any changes at the hardware level ,like we have software which can affect the functioning of CPU , can play with registry details , so how can we say that at user mode , we have only subset of machine instructions available ?
The instructions that are available only in kernel mode are tend to be very few. These instructions are those that are only needed to manage the system.
For example, most processors have a HALT instruction that stops the CPU that is used for system shutdowns. Obviously you would not want any user to be able to execute HALT and stop the computer for everyone. Such instructions are then made only accessible in kernel mode.
Processors use a table of handlers for interrupt and exceptions. The Operating system creates such a table listing the handlers for these events. Then it loads register(s) giving the location(and size) of the table. The instructions for loading this register(s) are kernel mode only. Otherwise, any application could create total havoc on the system.
Instructions of these nature will trigger an exception if executing in user mode.
Such instructions tend to be few in number.
Well, in user-mode, there is definitely a subset of instructions available. This is the reason we have System Calls.
Example:
A user wants to create a new process in C. He cannot do that without entering kernel-mode, because certain set of instructions are only available to kernel-mode, So he uses the system call fork, that executes instructions for creating a new process (not available in user-mode). So System call is a mechanism of requesting a service from kernel of the OS to do something for the user, which he/she cannot write code for.
Following excerpt from above link sums it up in the best way:
A program is usually limited to its own address space so that it
cannot access or modify other running programs or the operating system
itself, and is usually prevented from directly manipulating hardware
devices (e.g. the frame buffer or network devices).
However, many normal applications obviously need access to these
components, so system calls are made available by the operating system
to provide well defined, safe implementations for such operations. The
operating system executes at the highest level of privilege, and
allows applications to request services via system calls, which are
often initiated via interrupts. An interrupt automatically puts the
CPU into some elevated privilege level, and then passes control to the
kernel, which determines whether the calling program should be granted
the requested service. If the service is granted, the kernel executes
a specific set of instructions over which the calling program has no
direct control, returns the privilege level to that of the calling
program, and then returns control to the calling program.
I heard of privilege levels, rings, privileged instructions, non privileged instructions, user mode, kernel mode, user space, kernel space.
User process will run with low privilege where OS process with higher ,also I heard about CPL register which responsible for general protection. Also CPU only know CPL and it is decided basis of to which page instruction belongs to.
I want to know who/what decides initially the privilege level of process?
When it is decided that process will run with low or high privilege level? At compile time? At loading?
What tells that current program will run with specific privilege level? Segment registers? Descriptors? Loader ?
Firstly I see 3 questions.
Who/What decides initially the privilege level of process ?
When it is decided that process will run with low or high privilege level?
What tells that current program will run with specific privilege level?
Secondly to confirm the definition of some terms
When you say privilege level, I believe you are referring to the concept of level of privilege associated with CPU processor mode as opposed the generic level of any other privilege mechanism available.
When you say process, I believe you are referring to the concept of the currently running program as opposed to some alternate definition.
User processes run in user mode with the user privilege for a given CPU architecture
Kernel processes run in kernel mode with the supervisor privilege for a given CPU architecture.
Whether the process is user or kernel depends on which flags are set either in segment descriptors when paging isn't used or in page table or page directory entries where paging is used.
This means that the privilege level of a process is determined by where that process's code is located in memory. If it is in kernel space and marked as such using the relevant flags, then it is a kernel process. If it is in user space and marked as such using the relevant flags, then it is a user process.
If the process / program you are running isn't the kernel, it is a user process on most modern operating systems. So when the decision is made is at program execution time, specifically operating system initialization time when the kernel is first loaded.
Either the process is that kernel and is runs at supervisor privilege level or it isn't and it runs at user privilege level.
The CPU checks every execution of any code or data segment from memory against the relevant status registers (code status register on Intel X86, and current program status register on ARM).
When user processes need to access kernel resources, the general way this is done is by allowing the user process to ask the kernel process on its behalf by making a system call, which makes a privilege context switch when the kernel process runs the request for the user process.
As a side note, Kernel Mode Linux, allows you to run user processes in kernel / supervisor mode.
References and further reading
OS Dev Security Page
OS Dev Segmentation Page
OS Dev Paging Page
OS Dev ARM overview
Memory Translation and Segmentation
CPU Rings, Privilege, and Protection
Operating System Privilege: Protection and Isolation
Operating Modes, System Calls and Interrupts
Paging and Segmentation
Most processors have a trap or software fault instruction that switches the processor into privileged mode. The kernel checks if the user mode process has permission for the particular operation. Since kernel data is protected, the kernel can enforce security policy - the user process can't directly give itself permissions.
Permissions are sometimes called privileges, so that's why I wanted to explain how processor modes work in enforcing security permissions.
I'm studying for my operating systems final and was wondering if someone could tell me why the OS needs to switch into kernel mode for syscalls?
A syscall is used specifically to run an operating in the kernel mode since the usual user code is not allowed to do this for security reasons.
For example, if you wanted to allocate memory, the operating system is privileged to do it (since it knows the page tables and is allowed to access memory of other processes), but you as a user program should not be allowed to peek or ruin the memory of other processes.
It's a way of sandboxing you. So you send a syscall requesting the operating system to allocate memory, and that happens at the kernel level.
Edit: I see now that the Wikipedia article is surprisingly useful on this
Since this is tagged "homework", I won't just give the answer away but will provide a hint:
The kernel is responsible for accessing the hardware of the computer and ensuring that applications don't step on one another. What would happen if any application could access a hardware device (say, the hard drive) without the cooperation of the kernel?