How does x86_64 `syscall` instruction find the desired system call address? [duplicate] - operating-system

How does Linux determine the address of another process to execute with a syscall? Like in this example?
mov rax, 59
mov rdi, progName
syscall
It seems there is a bit of confusion with my question, to clarify, what I was asking is how does syscall works, independently of the registers or arguments passed. How it knows where to jump, return etc when an other process is called.

syscall
The syscall instruction is really just an INTEL/AMD CPU instruction. Here is the synopsis:
IF (CS.L ≠ 1 ) or (IA32_EFER.LMA ≠ 1) or (IA32_EFER.SCE ≠ 1)
(* Not in 64-Bit Mode or SYSCALL/SYSRET not enabled in IA32_EFER *)
THEN #UD;
FI;
RCX ← RIP; (* Will contain address of next instruction *)
RIP ← IA32_LSTAR;
R11 ← RFLAGS;
RFLAGS ← RFLAGS AND NOT(IA32_FMASK);
CS.Selector ← IA32_STAR[47:32] AND FFFCH (* Operating system provides CS; RPL forced to 0 *)
(* Set rest of CS to a fixed value *)
CS.Base ← 0;
(* Flat segment *)
CS.Limit ← FFFFFH;
(* With 4-KByte granularity, implies a 4-GByte limit *)
CS.Type ← 11;
(* Execute/read code, accessed *)
CS.S ← 1;
CS.DPL ← 0;
CS.P ← 1;
CS.L ← 1;
(* Entry is to 64-bit mode *)
CS.D ← 0;
(* Required if CS.L = 1 *)
CS.G ← 1;
(* 4-KByte granularity *)
CPL ← 0;
SS.Selector ← IA32_STAR[47:32] + 8;
(* SS just above CS *)
(* Set rest of SS to a fixed value *)
SS.Base ← 0;
(* Flat segment *)
SS.Limit ← FFFFFH;
(* With 4-KByte granularity, implies a 4-GByte limit *)
SS.Type ← 3;
(* Read/write data, accessed *)
SS.S ← 1;
SS.DPL ← 0;
SS.P ← 1;
SS.B ← 1;
(* 32-bit stack segment *)
SS.G ← 1;
(* 4-KByte granularity *)
The most important part are the two instructions that save and manage the RIP register:
RCX ← RIP
RIP ← IA32_LSTAR
So in other words, there must be code at the address saved in IA32_LSTAR (a register) and RCX is the return address.
The CS and SS segments are also tweaked so your kernel code will be able to further run at CPU Level 0 (a privileged level.)
The #UD may happen if you do not have the right to execute syscall or if the instruction doesn't exist.
How is RAX interpreted?
This is just an index into a table of kernel function pointers. First the kernel does a bounds-check (and returns -ENOSYS if RAX > __NR_syscall_max), then dispatches to (C syntax) sys_call_table[rax](rdi, rsi, rdx, r10, r8, r9);
; Intel-syntax translation of Linux 4.12 syscall entry point
... ; save user-space registers etc.
call [sys_call_table + rax * 8] ; dispatch to sys_execve() or whatever kernel C function
;;; execve probably won't return via this path, but most other calls will
... ; restore registers except RAX return value, and return to user-space
Modern Linux is more complicated in practice because of workarounds for x86 vulnerabilities like Meltdown and L1TF by changing the page tables so most of kernel memory isn't mapped while user-space is running. The above code is a literal translation (from AT&T syntax) of call *sys_call_table(, %rax, 8) from ENTRY(entry_SYSCALL_64) in Linux 4.12 arch/x86/entry/entry_64.S (before Spectre/Meltdown mitigations were added). Also related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? has some more details about the kernel side of system-call dispatching.
Fast?
The instruction is said to be fast. This is because in the old days one would have to use an instruction such as INT3. The interrupts make use of the kernel stack, it pushes many registers on the stack and uses the rather slow IRET to exit the exception state and return to the address just after the interrupt. This is generally much slower.
With the syscall you may be able to avoid most of that overhead. However, in what you're asking, this is not really going to help.
Another instruction which is used along syscall is swapgs. This gives the kernel a way to access its own data and stack. You should look at the Intel/AMD documentation about those instructions for more details.
New Process?
The Linux system has what it calls a task table. Each process and each thread within a process is actually called a task.
When you create a new process, Linux creates a task. For that to work, it runs codes which does things such as:
Make sure the executable exists
Setup a new task (including parsing the ELF program headers from that executable to create memory mappings in the newly-created virtual address space.)
Allocates a stack buffer
Load the first few blocks of the executable (as an optimization for demand paging), allocating some physical pages for the virtual pages to map to.
Setup the start address in the task (ELF entry point from the executable)
Mark the task as ready (a.k.a. running)
This is, of course, super simplified.
The start address is defined in your ELF binary. It really only needs to determine that one address and save it in the task current RIP pointer and "return" to user-space. The normal demand-paging mechanism will take care of the rest: if the code is not yet loaded, it will generate a #PF page-fault exception and the kernel will load the necessary code at that point. Although in most cases the loader will already have some part of the software loaded as an optimization to avoid that initial page-fault.
(A #PF on a page that isn't mapped would result in the kernel delivering a SIGSEGV segfault signal to your process, but a "valid" page fault is handled silently by the kernel.)
All new processes usually get loaded at the same virtual address (ignoring PIE + ASLR). This is possible because we use the MMU (Memory Management Unit). That coprocessor translates memory addresses between virtual address spaces and physical address space.
(Editor's note: the MMU isn't really a coprocessor; in modern CPUs virtual memory logic is tightly integrated into each core, along side the L1 instruction/data caches. Some ancient CPUs did use an external MMU chip, though.)
Determine the Address?
So, now we understand that all processes have the same virtual address (0x400000 under Linux is the default chosen by ld). To determine the real physical address we use the MMU. How does the kernel decide of that physical address? Well, it has a memory allocation function. That simple.
It calls a "malloc()" type of function which searches for a memory block which is not currently used and creates (a.k.a. loads) the process at that location. If no memory block is currently available, the kernel checks for swapping something out of memory. If that fails, the creation of the process fails.
In case of a process creation, it will allocate pretty large blocks of memory to start with. It is not unusual to allocate 1Mb or 2Mb buffers to start a new process. This makes things go a lot faster.
Also, if the process is already running and you start it again, a lot of the memory used by the already running instance can be reused. In that case the kernel does not allocate/load those parts. It will use the MMU to share those pages that can be made common to both instances of the process (i.e. in most cases the code part of the process can be shared since it is read-only, some part of the data can be shared when it is also marked as read-only; if not marked read-only, the data can still be shared if it wasn't modified yet--in this case it's marked as copy on write.)

Related

Compiling an OS and defining the system calls

I'm trying to better understand operating systems, not the theory behind them but how real people write real OS code.
I know most OS's are written in C. I know the source code for these OS's include calls to functions like malloc, calloc, etc, to allocate memory for a process, etc.
Under normal conditions, i.e, when compiling code destined to run on an OS, I know that the C compiler will use the underlying OS's system calls to execute these functions. But when compiling the source code for these OS's, how does the compiler know what to do. The system calls don't exist cause they're defined by the OS. Does the compiler just call some assembly routine, which will eventually become a system call?
It is complex because you need to understand several things about OS development to get the big picture. Overall, the OS isn't like a process which executes in-order like average user mode code. When the computer boots, the OS will execute in-order to set up its environment. After that, the OS is basically system calls and interrupts.
Each CPU works differently but most CPUs will have an interrupt table mechanism and a syscall mechanism. The interrupt table specifies where to jump for a certain interrupt number and the syscall mechanism is probably one register containing the address of the entry point for a syscall. It works like this on x86-64 (most desktop/laptop computers). x86-64 has the IDT (Interrupt Descriptor Table) and the syscall register is IA32_LSTAR.
The OS isn't written in C because you can call malloc() or anything. The OS is written in C because you can make C static and freestanding (all code needed is in the executable and it doesn't rely on any external software to the executable). Actually, when writing an OS, you cannot call malloc(). You need to avoid any standard library function implementations and use static and freestanding code (the base of C like structs, pointers, arithmetic, variables, etc). C is also used because you can modify arbitrary memory locations with pointers. For example,
unsigned int* ptr = (unsigned int*)0x1234;
*ptr = 0x87654321;
makes sure that address 0x1234 contains 0x87654321. You can also use binary operators (and, or, xor, shift, etc) to modify memory at the bit level.
To answer your question, if you want to define the system calls in an OS that you write yourself, you simply consider your syscalls to be that way. When you write your syscall handler, you consider that someone using your OS knows that a certain syscall number is requesting a certain operation so you oblige by doing that. For example, Linux uses SysV as a convention for syscall numbers and registers used to pass arguments to them (including the syscall number). On x86-64 Linux, from user mode, you put your syscall number in RAX and use the instruction syscall. The processor then looks in IA32_LSTAR for the address of the syscall handler and jumps to it. The processor (the core) is now in kernel mode (in the kernel). The kernel now looks at RAX for the syscall number and answers the request by the doing the associated operation (after several checkups and a bunch of other things).

Privileged instructions in Intel x86-64 [duplicate]

According to this source (Level 3 - 5) specific CPU rings can not do certain things, such as ring 1, 2, 3 code can not set up GDT, as os kernel would crash.
While it is obvious that Ring 0 can execute all instructions, I am wondering which instructions can not be issued in rings 1, 2 and 3?
I could not find anything on either wikipedia or osdev and similar sources which would state what instructions can not be issued in specific ring.
The following instructions cannot be executed in Ring 3:
LGDT
LLDT
LTR
LIDT
MOV (to and from control registers only)
MOV (to and from debug registers only)
LMSW
CLTS
INVD
WBINVD
INVLPG
HLT
RDMSR
WRMSR
RDPMC
RDTSC

What are Ring 0 and Ring 3 in the context of operating systems?

I've been learning basics about driver development in Windows I keep finding the terms Ring 0 and Ring 3. What do these refer to? Are they the same thing as kernel mode and user mode?
Linux x86 ring usage overview
Understanding how rings are used in Linux will give you a good idea of what they are designed for.
In x86 protected mode, the CPU is always in one of 4 rings. The Linux kernel only uses 0 and 3:
0 for kernel
3 for users
This is the most hard and fast definition of kernel vs userland.
Why Linux does not use rings 1 and 2: CPU Privilege Rings: Why rings 1 and 2 aren't used?
How is the current ring determined?
The current ring is selected by a combination of:
global descriptor table: a in-memory table of GDT entries, and each entry has a field Privl which encodes the ring.
The LGDT instruction sets the address to the current descriptor table.
See also: http://wiki.osdev.org/Global_Descriptor_Table
the segment registers CS, DS, etc., which point to the index of an entry in the GDT.
For example, CS = 0 means the first entry of the GDT is currently active for the executing code.
What can each ring do?
The CPU chip is physically built so that:
ring 0 can do anything
ring 3 cannot run several instructions and write to several registers, most notably:
cannot change its own ring! Otherwise, it could set itself to ring 0 and rings would be useless.
In other words, cannot modify the current segment descriptor, which determines the current ring.
cannot modify the page tables: How does x86 paging work?
In other words, cannot modify the CR3 register, and paging itself prevents modification of the page tables.
This prevents one process from seeing the memory of other processes for security / ease of programming reasons.
cannot register interrupt handlers. Those are configured by writing to memory locations, which is also prevented by paging.
Handlers run in ring 0, and would break the security model.
In other words, cannot use the LGDT and LIDT instructions.
cannot do IO instructions like in and out, and thus have arbitrary hardware accesses.
Otherwise, for example, file permissions would be useless if any program could directly read from disk.
More precisely thanks to Michael Petch: it is actually possible for the OS to allow IO instructions on ring 3, this is actually controlled by the Task state segment.
What is not possible is for ring 3 to give itself permission to do so if it didn't have it in the first place.
Linux always disallows it. See also: Why doesn't Linux use the hardware context switch via the TSS?
How do programs and operating systems transition between rings?
when the CPU is turned on, it starts running the initial program in ring 0 (well kind of, but it is a good approximation). You can think this initial program as being the kernel (but it is normally a bootloader that then calls the kernel still in ring 0).
when a userland process wants the kernel to do something for it like write to a file, it uses an instruction that generates an interrupt such as int 0x80 or syscall to signal the kernel. x86-64 Linux syscall hello world example:
.data
hello_world:
.ascii "hello world\n"
hello_world_len = . - hello_world
.text
.global _start
_start:
/* write */
mov $1, %rax
mov $1, %rdi
mov $hello_world, %rsi
mov $hello_world_len, %rdx
syscall
/* exit */
mov $60, %rax
mov $0, %rdi
syscall
compile and run:
as -o hello_world.o hello_world.S
ld -o hello_world.out hello_world.o
./hello_world.out
GitHub upstream.
When this happens, the CPU calls an interrupt callback handler which the kernel registered at boot time. Here is a concrete baremetal example that registers a handler and uses it.
This handler runs in ring 0, which decides if the kernel will allow this action, do the action, and restart the userland program in ring 3. x86_64
when the exec system call is used (or when the kernel will start /init), the kernel prepares the registers and memory of the new userland process, then it jumps to the entry point and switches the CPU to ring 3
If the program tries to do something naughty like write to a forbidden register or memory address (because of paging), the CPU also calls some kernel callback handler in ring 0.
But since the userland was naughty, the kernel might kill the process this time, or give it a warning with a signal.
When the kernel boots, it setups a hardware clock with some fixed frequency, which generates interrupts periodically.
This hardware clock generates interrupts that run ring 0, and allow it to schedule which userland processes to wake up.
This way, scheduling can happen even if the processes are not making any system calls.
What is the point of having multiple rings?
There are two major advantages of separating kernel and userland:
it is easier to make programs as you are more certain one won't interfere with the other. E.g., one userland process does not have to worry about overwriting the memory of another program because of paging, nor about putting hardware in an invalid state for another process.
it is more secure. E.g. file permissions and memory separation could prevent a hacking app from reading your bank data. This supposes, of course, that you trust the kernel.
How to play around with it?
I've created a bare metal setup that should be a good way to manipulate rings directly: https://github.com/cirosantilli/x86-bare-metal-examples
I didn't have the patience to make a userland example unfortunately, but I did go as far as paging setup, so userland should be feasible. I'd love to see a pull request.
Alternatively, Linux kernel modules run in ring 0, so you can use them to try out privileged operations, e.g. read the control registers: How to access the control registers cr0,cr2,cr3 from a program? Getting segmentation fault
Here is a convenient QEMU + Buildroot setup to try it out without killing your host.
The downside of kernel modules is that other kthreads are running and could interfere with your experiments. But in theory you can take over all interrupt handlers with your kernel module and own the system, that would be an interesting project actually.
Negative rings
While negative rings are not actually referenced in the Intel manual, there are actually CPU modes which have further capabilities than ring 0 itself, and so are a good fit for the "negative ring" name.
One example is the hypervisor mode used in virtualization.
For further details see:
https://security.stackexchange.com/questions/129098/what-is-protection-ring-1
https://security.stackexchange.com/questions/216527/ring-3-exploits-and-existence-of-other-rings
ARM
In ARM, the rings are called Exception Levels instead, but the main ideas remain the same.
There exist 4 exception levels in ARMv8, commonly used as:
EL0: userland
EL1: kernel ("supervisor" in ARM terminology).
Entered with the svc instruction (SuperVisor Call), previously known as swi before unified assembly, which is the instruction used to make Linux system calls. Hello world ARMv8 example:
hello.S
.text
.global _start
_start:
/* write */
mov x0, 1
ldr x1, =msg
ldr x2, =len
mov x8, 64
svc 0
/* exit */
mov x0, 0
mov x8, 93
svc 0
msg:
.ascii "hello syscall v8\n"
len = . - msg
GitHub upstream.
Test it out with QEMU on Ubuntu 16.04:
sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
arm-linux-gnueabihf-as -o hello.o hello.S
arm-linux-gnueabihf-ld -o hello hello.o
qemu-arm hello
Here is a concrete baremetal example that registers an SVC handler and does an SVC call.
EL2: hypervisors, for example Xen.
Entered with the hvc instruction (HyperVisor Call).
A hypervisor is to an OS, what an OS is to userland.
For example, Xen allows you to run multiple OSes such as Linux or Windows on the same system at the same time, and it isolates the OSes from one another for security and ease of debug, just like Linux does for userland programs.
Hypervisors are a key part of today's cloud infrastructure: they allow multiple servers to run on a single hardware, keeping hardware usage always close to 100% and saving a lot of money.
AWS for example used Xen until 2017 when its move to KVM made the news.
EL3: yet another level. TODO example.
Entered with the smc instruction (Secure Mode Call)
The ARMv8 Architecture Reference Model DDI 0487C.a - Chapter D1 - The AArch64 System Level Programmer's Model - Figure D1-1 illustrates this beautifully:
The ARM situation changed a bit with the advent of ARMv8.1 Virtualization Host Extensions (VHE). This extension allows the kernel to run in EL2 efficiently:
VHE was created because in-Linux-kernel virtualization solutions such as KVM have gained ground over Xen (see e.g. AWS' move to KVM mentioned above), because most clients only need Linux VMs, and as you can imagine, being all in a single project, KVM is simpler and potentially more efficient than Xen. So now the host Linux kernel acts as the hypervisor in those cases.
From the image we can see that when the bit E2H of register HCR_EL2 equals 1, then VHE is enabled, and:
the Linux kernel runs in EL2 instead of EL1
when HCR_EL2.TGE == 1, we are a regular host userland program. Using sudo can destroy the host as usual.
when HCR_EL2.TGE == 0 we are a guest OS (e.g. when you run an Ubuntu OS inside QEMU KVM inside the host Ubuntu. Doing sudo cannot destroy the host unless there's a QEMU/host kernel bug.
Note how ARM, maybe due to the benefit of hindsight, has a better naming convention for the privilege levels than x86, without the need for negative levels: 0 being the lower and 3 highest. Higher levels tend to be created more often than lower ones.
The current EL can be queried with the MRS instruction: what is the current execution mode/exception level, etc?
ARM does not require all exception levels to be present to allow for implementations that don't need the feature to save chip area. ARMv8 "Exception levels" says:
An implementation might not include all of the Exception levels. All implementations must include EL0 and EL1.
EL2 and EL3 are optional.
QEMU for example defaults to EL1, but EL2 and EL3 can be enabled with command line options: qemu-system-aarch64 entering el1 when emulating a53 power up
Code snippets tested on Ubuntu 18.10.
Intel processors (x86 and others) allow applications limited powers. To restrict (protect) critical resources like IO, memory, ports etc, CPU in liaison with the OS (Windows in this case) provides privilege levels (0 being most privilege to 3 being least) that map to kernel mode and user mode respectively.
So, the OS runs kernel code in ring 0 - highest privilege level (of 0) provided by the CPU - and user code in ring 3.
For more details, see http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection/

entering ring 0 from user mode

Most modern operating systems run in the protected mode. Now is it possible for the user programs to enter the "ring 0" by directly setting the corresponding bits in some control registers. Or does it have to go through some syscall.
I believe to access the hardware we need to go through the operating system. But if we know the address of the hardware device can we just write some assembly language code with reference to the location of the device and access it. What happens when we give the address of some hardware device in the assembly language code.
Thanks.
To enter Ring 0, you must perform a system call, and by its nature, the system controls where you go, because for the call you simply give an index to the CPU, and the CPU looks inside a table to know what to call. You can't really get around the security aspect (obviously) to do something else, but maybe this link will help.
You can ask the operating system to map the memory of the hardware device into the memory space of your program. Once that's done, you can just read and write that memory from ring 3. Whether that's possible to do, or how to do that, depends on the operating system or the device.
; set PE bit
mov cr0, eax
or eax, 1
mov eax, cr0
; far jump (cs = selector of code segment)
jmp cs:#pm
#pm:
; Now we are in PM
Taken from Wikipedia.
Basic idea is to set (to 1) 0th bit in cr0 control register.
But if you are already in protected mode (i.e. you are in windows/linux), security restricts you to do it (you are in ring 3 - lowest trust).
So be the first one to get into protected mode.

What is the difference between user and kernel modes in operating systems?

What are the differences between User Mode and Kernel Mode, why and how do you activate either of them, and what are their use cases?
Kernel Mode
In Kernel mode, the executing code has complete and unrestricted
access to the underlying hardware. It
can execute any CPU instruction and
reference any memory address. Kernel
mode is generally reserved for the
lowest-level, most trusted functions
of the operating system. Crashes in
kernel mode are catastrophic; they
will halt the entire PC.
User Mode
In User mode, the executing code has no ability to directly access
hardware or reference memory. Code
running in user mode must delegate to
system APIs to access hardware or
memory. Due to the protection afforded
by this sort of isolation, crashes in
user mode are always recoverable. Most
of the code running on your computer
will execute in user mode.
Read more
Understanding User and Kernel Mode
These are two different modes in which your computer can operate. Prior to this, when computers were like a big room, if something crashes – it halts the whole computer. So computer architects decide to change it. Modern microprocessors implement in hardware at least 2 different states.
User mode:
mode where all user programs execute. It does not have access to RAM
and hardware. The reason for this is because if all programs ran in
kernel mode, they would be able to overwrite each other’s memory. If
it needs to access any of these features – it makes a call to the
underlying API. Each process started by windows except of system
process runs in user mode.
Kernel mode:
mode where all kernel programs execute (different drivers). It has
access to every resource and underlying hardware. Any CPU instruction
can be executed and every memory address can be accessed. This mode
is reserved for drivers which operate on the lowest level
How the switch occurs.
The switch from user mode to kernel mode is not done automatically by CPU. CPU is interrupted by interrupts (timers, keyboard, I/O). When interrupt occurs, CPU stops executing the current running program, switch to kernel mode, executes interrupt handler. This handler saves the state of CPU, performs its operations, restore the state and returns to user mode.
http://en.wikibooks.org/wiki/Windows_Programming/User_Mode_vs_Kernel_Mode
http://tldp.org/HOWTO/KernelAnalysis-HOWTO-3.html
http://en.wikipedia.org/wiki/Direct_memory_access
http://en.wikipedia.org/wiki/Interrupt_request
CPU rings are the most clear distinction
In x86 protected mode, the CPU is always in one of 4 rings. The Linux kernel only uses 0 and 3:
0 for kernel
3 for users
This is the most hard and fast definition of kernel vs userland.
Why Linux does not use rings 1 and 2: CPU Privilege Rings: Why rings 1 and 2 aren't used?
How is the current ring determined?
The current ring is selected by a combination of:
global descriptor table: a in-memory table of GDT entries, and each entry has a field Privl which encodes the ring.
The LGDT instruction sets the address to the current descriptor table.
See also: http://wiki.osdev.org/Global_Descriptor_Table
the segment registers CS, DS, etc., which point to the index of an entry in the GDT.
For example, CS = 0 means the first entry of the GDT is currently active for the executing code.
What can each ring do?
The CPU chip is physically built so that:
ring 0 can do anything
ring 3 cannot run several instructions and write to several registers, most notably:
cannot change its own ring! Otherwise, it could set itself to ring 0 and rings would be useless.
In other words, cannot modify the current segment descriptor, which determines the current ring.
cannot modify the page tables: How does x86 paging work?
In other words, cannot modify the CR3 register, and paging itself prevents modification of the page tables.
This prevents one process from seeing the memory of other processes for security / ease of programming reasons.
cannot register interrupt handlers. Those are configured by writing to memory locations, which is also prevented by paging.
Handlers run in ring 0, and would break the security model.
In other words, cannot use the LGDT and LIDT instructions.
cannot do IO instructions like in and out, and thus have arbitrary hardware accesses.
Otherwise, for example, file permissions would be useless if any program could directly read from disk.
More precisely thanks to Michael Petch: it is actually possible for the OS to allow IO instructions on ring 3, this is actually controlled by the Task state segment.
What is not possible is for ring 3 to give itself permission to do so if it didn't have it in the first place.
Linux always disallows it. See also: Why doesn't Linux use the hardware context switch via the TSS?
How do programs and operating systems transition between rings?
when the CPU is turned on, it starts running the initial program in ring 0 (well kind of, but it is a good approximation). You can think this initial program as being the kernel (but it is normally a bootloader that then calls the kernel still in ring 0).
when a userland process wants the kernel to do something for it like write to a file, it uses an instruction that generates an interrupt such as int 0x80 or syscall to signal the kernel. x86-64 Linux syscall hello world example:
.data
hello_world:
.ascii "hello world\n"
hello_world_len = . - hello_world
.text
.global _start
_start:
/* write */
mov $1, %rax
mov $1, %rdi
mov $hello_world, %rsi
mov $hello_world_len, %rdx
syscall
/* exit */
mov $60, %rax
mov $0, %rdi
syscall
compile and run:
as -o hello_world.o hello_world.S
ld -o hello_world.out hello_world.o
./hello_world.out
GitHub upstream.
When this happens, the CPU calls an interrupt callback handler which the kernel registered at boot time. Here is a concrete baremetal example that registers a handler and uses it.
This handler runs in ring 0, which decides if the kernel will allow this action, do the action, and restart the userland program in ring 3. x86_64
when the exec system call is used (or when the kernel will start /init), the kernel prepares the registers and memory of the new userland process, then it jumps to the entry point and switches the CPU to ring 3
If the program tries to do something naughty like write to a forbidden register or memory address (because of paging), the CPU also calls some kernel callback handler in ring 0.
But since the userland was naughty, the kernel might kill the process this time, or give it a warning with a signal.
When the kernel boots, it setups a hardware clock with some fixed frequency, which generates interrupts periodically.
This hardware clock generates interrupts that run ring 0, and allow it to schedule which userland processes to wake up.
This way, scheduling can happen even if the processes are not making any system calls.
What is the point of having multiple rings?
There are two major advantages of separating kernel and userland:
it is easier to make programs as you are more certain one won't interfere with the other. E.g., one userland process does not have to worry about overwriting the memory of another program because of paging, nor about putting hardware in an invalid state for another process.
it is more secure. E.g. file permissions and memory separation could prevent a hacking app from reading your bank data. This supposes, of course, that you trust the kernel.
How to play around with it?
I've created a bare metal setup that should be a good way to manipulate rings directly: https://github.com/cirosantilli/x86-bare-metal-examples
I didn't have the patience to make a userland example unfortunately, but I did go as far as paging setup, so userland should be feasible. I'd love to see a pull request.
Alternatively, Linux kernel modules run in ring 0, so you can use them to try out privileged operations, e.g. read the control registers: How to access the control registers cr0,cr2,cr3 from a program? Getting segmentation fault
Here is a convenient QEMU + Buildroot setup to try it out without killing your host.
The downside of kernel modules is that other kthreads are running and could interfere with your experiments. But in theory you can take over all interrupt handlers with your kernel module and own the system, that would be an interesting project actually.
Negative rings
While negative rings are not actually referenced in the Intel manual, there are actually CPU modes which have further capabilities than ring 0 itself, and so are a good fit for the "negative ring" name.
One example is the hypervisor mode used in virtualization.
For further details see:
https://security.stackexchange.com/questions/129098/what-is-protection-ring-1
https://security.stackexchange.com/questions/216527/ring-3-exploits-and-existence-of-other-rings
ARM
In ARM, the rings are called Exception Levels instead, but the main ideas remain the same.
There exist 4 exception levels in ARMv8, commonly used as:
EL0: userland
EL1: kernel ("supervisor" in ARM terminology).
Entered with the svc instruction (SuperVisor Call), previously known as swi before unified assembly, which is the instruction used to make Linux system calls. Hello world ARMv8 example:
hello.S
.text
.global _start
_start:
/* write */
mov x0, 1
ldr x1, =msg
ldr x2, =len
mov x8, 64
svc 0
/* exit */
mov x0, 0
mov x8, 93
svc 0
msg:
.ascii "hello syscall v8\n"
len = . - msg
GitHub upstream.
Test it out with QEMU on Ubuntu 16.04:
sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
arm-linux-gnueabihf-as -o hello.o hello.S
arm-linux-gnueabihf-ld -o hello hello.o
qemu-arm hello
Here is a concrete baremetal example that registers an SVC handler and does an SVC call.
EL2: hypervisors, for example Xen.
Entered with the hvc instruction (HyperVisor Call).
A hypervisor is to an OS, what an OS is to userland.
For example, Xen allows you to run multiple OSes such as Linux or Windows on the same system at the same time, and it isolates the OSes from one another for security and ease of debug, just like Linux does for userland programs.
Hypervisors are a key part of today's cloud infrastructure: they allow multiple servers to run on a single hardware, keeping hardware usage always close to 100% and saving a lot of money.
AWS for example used Xen until 2017 when its move to KVM made the news.
EL3: yet another level. TODO example.
Entered with the smc instruction (Secure Mode Call)
The ARMv8 Architecture Reference Model DDI 0487C.a - Chapter D1 - The AArch64 System Level Programmer's Model - Figure D1-1 illustrates this beautifully:
The ARM situation changed a bit with the advent of ARMv8.1 Virtualization Host Extensions (VHE). This extension allows the kernel to run in EL2 efficiently:
VHE was created because in-Linux-kernel virtualization solutions such as KVM have gained ground over Xen (see e.g. AWS' move to KVM mentioned above), because most clients only need Linux VMs, and as you can imagine, being all in a single project, KVM is simpler and potentially more efficient than Xen. So now the host Linux kernel acts as the hypervisor in those cases.
Note how ARM, maybe due to the benefit of hindsight, has a better naming convention for the privilege levels than x86, without the need for negative levels: 0 being the lower and 3 highest. Higher levels tend to be created more often than lower ones.
The current EL can be queried with the MRS instruction: what is the current execution mode/exception level, etc?
ARM does not require all exception levels to be present to allow for implementations that don't need the feature to save chip area. ARMv8 "Exception levels" says:
An implementation might not include all of the Exception levels. All implementations must include EL0 and EL1.
EL2 and EL3 are optional.
QEMU for example defaults to EL1, but EL2 and EL3 can be enabled with command line options: qemu-system-aarch64 entering el1 when emulating a53 power up
Code snippets tested on Ubuntu 18.10.
A processor in a computer running Windows has two different modes: user mode and kernel mode. The processor switches between the two modes depending on what type of code is running on the processor. Applications run in user mode, and core operating system components run in kernel mode. While many drivers run in kernel mode, some drivers may run in user mode.
When you start a user-mode application, Windows creates a process for the application. The process provides the application with a private virtual address space and a private handle table. Because an application's virtual address space is private, one application cannot alter data that belongs to another application. Each application runs in isolation, and if an application crashes, the crash is limited to that one application. Other applications and the operating system are not affected by the crash.
In addition to being private, the virtual address space of a user-mode application is limited. A processor running in user mode cannot access virtual addresses that are reserved for the operating system. Limiting the virtual address space of a user-mode application prevents the application from altering, and possibly damaging, critical operating system data.
All code that runs in kernel mode shares a single virtual address space. This means that a kernel-mode driver is not isolated from other drivers and the operating system itself. If a kernel-mode driver accidentally writes to the wrong virtual address, data that belongs to the operating system or another driver could be compromised. If a kernel-mode driver crashes, the entire operating system crashes.
If you are a Windows user once go through this link you will get more.
Communication between user mode and kernel mode
I'm going to take a stab in the dark and guess you're talking about Windows. In a nutshell, kernel mode has full access to hardware, but user mode doesn't. For instance, many if not most device drivers are written in kernel mode because they need to control finer details of their hardware.
See also this wikibook.
Other answers already explained the difference between user and kernel mode. If you really want to get into detail you should get a copy of
Windows Internals, an excellent book written by Mark Russinovich and David Solomon describing the architecture and inside details of the various Windows operating systems.
What
Basically the difference between kernel and user modes is not OS dependent and is achieved only by restricting some instructions to be run only in kernel mode by means of hardware design. All other purposes like memory protection can be done only by that restriction.
How
It means that the processor lives in either the kernel mode or in the user mode. Using some mechanisms the architecture can guarantee that whenever it is switched to the kernel mode the OS code is fetched to be run.
Why
Having this hardware infrastructure these could be achieved in common OSes:
Protecting user programs from accessing whole the memory, to not let programs overwrite the OS for example,
preventing user programs from performing sensitive instructions such as those that change CPU memory pointer bounds, to not let programs break their memory bounds for example.