Privileged instructions in Intel x86-64 [duplicate] - x86-64

According to this source (Level 3 - 5) specific CPU rings can not do certain things, such as ring 1, 2, 3 code can not set up GDT, as os kernel would crash.
While it is obvious that Ring 0 can execute all instructions, I am wondering which instructions can not be issued in rings 1, 2 and 3?
I could not find anything on either wikipedia or osdev and similar sources which would state what instructions can not be issued in specific ring.

The following instructions cannot be executed in Ring 3:
LGDT
LLDT
LTR
LIDT
MOV (to and from control registers only)
MOV (to and from debug registers only)
LMSW
CLTS
INVD
WBINVD
INVLPG
HLT
RDMSR
WRMSR
RDPMC
RDTSC

Related

What are Ring 0 and Ring 3 in the context of operating systems?

I've been learning basics about driver development in Windows I keep finding the terms Ring 0 and Ring 3. What do these refer to? Are they the same thing as kernel mode and user mode?
Linux x86 ring usage overview
Understanding how rings are used in Linux will give you a good idea of what they are designed for.
In x86 protected mode, the CPU is always in one of 4 rings. The Linux kernel only uses 0 and 3:
0 for kernel
3 for users
This is the most hard and fast definition of kernel vs userland.
Why Linux does not use rings 1 and 2: CPU Privilege Rings: Why rings 1 and 2 aren't used?
How is the current ring determined?
The current ring is selected by a combination of:
global descriptor table: a in-memory table of GDT entries, and each entry has a field Privl which encodes the ring.
The LGDT instruction sets the address to the current descriptor table.
See also: http://wiki.osdev.org/Global_Descriptor_Table
the segment registers CS, DS, etc., which point to the index of an entry in the GDT.
For example, CS = 0 means the first entry of the GDT is currently active for the executing code.
What can each ring do?
The CPU chip is physically built so that:
ring 0 can do anything
ring 3 cannot run several instructions and write to several registers, most notably:
cannot change its own ring! Otherwise, it could set itself to ring 0 and rings would be useless.
In other words, cannot modify the current segment descriptor, which determines the current ring.
cannot modify the page tables: How does x86 paging work?
In other words, cannot modify the CR3 register, and paging itself prevents modification of the page tables.
This prevents one process from seeing the memory of other processes for security / ease of programming reasons.
cannot register interrupt handlers. Those are configured by writing to memory locations, which is also prevented by paging.
Handlers run in ring 0, and would break the security model.
In other words, cannot use the LGDT and LIDT instructions.
cannot do IO instructions like in and out, and thus have arbitrary hardware accesses.
Otherwise, for example, file permissions would be useless if any program could directly read from disk.
More precisely thanks to Michael Petch: it is actually possible for the OS to allow IO instructions on ring 3, this is actually controlled by the Task state segment.
What is not possible is for ring 3 to give itself permission to do so if it didn't have it in the first place.
Linux always disallows it. See also: Why doesn't Linux use the hardware context switch via the TSS?
How do programs and operating systems transition between rings?
when the CPU is turned on, it starts running the initial program in ring 0 (well kind of, but it is a good approximation). You can think this initial program as being the kernel (but it is normally a bootloader that then calls the kernel still in ring 0).
when a userland process wants the kernel to do something for it like write to a file, it uses an instruction that generates an interrupt such as int 0x80 or syscall to signal the kernel. x86-64 Linux syscall hello world example:
.data
hello_world:
.ascii "hello world\n"
hello_world_len = . - hello_world
.text
.global _start
_start:
/* write */
mov $1, %rax
mov $1, %rdi
mov $hello_world, %rsi
mov $hello_world_len, %rdx
syscall
/* exit */
mov $60, %rax
mov $0, %rdi
syscall
compile and run:
as -o hello_world.o hello_world.S
ld -o hello_world.out hello_world.o
./hello_world.out
GitHub upstream.
When this happens, the CPU calls an interrupt callback handler which the kernel registered at boot time. Here is a concrete baremetal example that registers a handler and uses it.
This handler runs in ring 0, which decides if the kernel will allow this action, do the action, and restart the userland program in ring 3. x86_64
when the exec system call is used (or when the kernel will start /init), the kernel prepares the registers and memory of the new userland process, then it jumps to the entry point and switches the CPU to ring 3
If the program tries to do something naughty like write to a forbidden register or memory address (because of paging), the CPU also calls some kernel callback handler in ring 0.
But since the userland was naughty, the kernel might kill the process this time, or give it a warning with a signal.
When the kernel boots, it setups a hardware clock with some fixed frequency, which generates interrupts periodically.
This hardware clock generates interrupts that run ring 0, and allow it to schedule which userland processes to wake up.
This way, scheduling can happen even if the processes are not making any system calls.
What is the point of having multiple rings?
There are two major advantages of separating kernel and userland:
it is easier to make programs as you are more certain one won't interfere with the other. E.g., one userland process does not have to worry about overwriting the memory of another program because of paging, nor about putting hardware in an invalid state for another process.
it is more secure. E.g. file permissions and memory separation could prevent a hacking app from reading your bank data. This supposes, of course, that you trust the kernel.
How to play around with it?
I've created a bare metal setup that should be a good way to manipulate rings directly: https://github.com/cirosantilli/x86-bare-metal-examples
I didn't have the patience to make a userland example unfortunately, but I did go as far as paging setup, so userland should be feasible. I'd love to see a pull request.
Alternatively, Linux kernel modules run in ring 0, so you can use them to try out privileged operations, e.g. read the control registers: How to access the control registers cr0,cr2,cr3 from a program? Getting segmentation fault
Here is a convenient QEMU + Buildroot setup to try it out without killing your host.
The downside of kernel modules is that other kthreads are running and could interfere with your experiments. But in theory you can take over all interrupt handlers with your kernel module and own the system, that would be an interesting project actually.
Negative rings
While negative rings are not actually referenced in the Intel manual, there are actually CPU modes which have further capabilities than ring 0 itself, and so are a good fit for the "negative ring" name.
One example is the hypervisor mode used in virtualization.
For further details see:
https://security.stackexchange.com/questions/129098/what-is-protection-ring-1
https://security.stackexchange.com/questions/216527/ring-3-exploits-and-existence-of-other-rings
ARM
In ARM, the rings are called Exception Levels instead, but the main ideas remain the same.
There exist 4 exception levels in ARMv8, commonly used as:
EL0: userland
EL1: kernel ("supervisor" in ARM terminology).
Entered with the svc instruction (SuperVisor Call), previously known as swi before unified assembly, which is the instruction used to make Linux system calls. Hello world ARMv8 example:
hello.S
.text
.global _start
_start:
/* write */
mov x0, 1
ldr x1, =msg
ldr x2, =len
mov x8, 64
svc 0
/* exit */
mov x0, 0
mov x8, 93
svc 0
msg:
.ascii "hello syscall v8\n"
len = . - msg
GitHub upstream.
Test it out with QEMU on Ubuntu 16.04:
sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
arm-linux-gnueabihf-as -o hello.o hello.S
arm-linux-gnueabihf-ld -o hello hello.o
qemu-arm hello
Here is a concrete baremetal example that registers an SVC handler and does an SVC call.
EL2: hypervisors, for example Xen.
Entered with the hvc instruction (HyperVisor Call).
A hypervisor is to an OS, what an OS is to userland.
For example, Xen allows you to run multiple OSes such as Linux or Windows on the same system at the same time, and it isolates the OSes from one another for security and ease of debug, just like Linux does for userland programs.
Hypervisors are a key part of today's cloud infrastructure: they allow multiple servers to run on a single hardware, keeping hardware usage always close to 100% and saving a lot of money.
AWS for example used Xen until 2017 when its move to KVM made the news.
EL3: yet another level. TODO example.
Entered with the smc instruction (Secure Mode Call)
The ARMv8 Architecture Reference Model DDI 0487C.a - Chapter D1 - The AArch64 System Level Programmer's Model - Figure D1-1 illustrates this beautifully:
The ARM situation changed a bit with the advent of ARMv8.1 Virtualization Host Extensions (VHE). This extension allows the kernel to run in EL2 efficiently:
VHE was created because in-Linux-kernel virtualization solutions such as KVM have gained ground over Xen (see e.g. AWS' move to KVM mentioned above), because most clients only need Linux VMs, and as you can imagine, being all in a single project, KVM is simpler and potentially more efficient than Xen. So now the host Linux kernel acts as the hypervisor in those cases.
From the image we can see that when the bit E2H of register HCR_EL2 equals 1, then VHE is enabled, and:
the Linux kernel runs in EL2 instead of EL1
when HCR_EL2.TGE == 1, we are a regular host userland program. Using sudo can destroy the host as usual.
when HCR_EL2.TGE == 0 we are a guest OS (e.g. when you run an Ubuntu OS inside QEMU KVM inside the host Ubuntu. Doing sudo cannot destroy the host unless there's a QEMU/host kernel bug.
Note how ARM, maybe due to the benefit of hindsight, has a better naming convention for the privilege levels than x86, without the need for negative levels: 0 being the lower and 3 highest. Higher levels tend to be created more often than lower ones.
The current EL can be queried with the MRS instruction: what is the current execution mode/exception level, etc?
ARM does not require all exception levels to be present to allow for implementations that don't need the feature to save chip area. ARMv8 "Exception levels" says:
An implementation might not include all of the Exception levels. All implementations must include EL0 and EL1.
EL2 and EL3 are optional.
QEMU for example defaults to EL1, but EL2 and EL3 can be enabled with command line options: qemu-system-aarch64 entering el1 when emulating a53 power up
Code snippets tested on Ubuntu 18.10.
Intel processors (x86 and others) allow applications limited powers. To restrict (protect) critical resources like IO, memory, ports etc, CPU in liaison with the OS (Windows in this case) provides privilege levels (0 being most privilege to 3 being least) that map to kernel mode and user mode respectively.
So, the OS runs kernel code in ring 0 - highest privilege level (of 0) provided by the CPU - and user code in ring 3.
For more details, see http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection/

Why Does the BIOS INT 0x19 Load Bootloader at "0x7C00"?

As we know the BIOS Interrupt (INT) 0x19 which searches for a boot signature (0xAA55). Loads and executes our bootloader at 0x7C00 if it found.
My Question : Why 0x7C00? What is the reason ? How to evaluate it through some methods?
Maybe because the MBR is loaded into the memory (by the BIOS) into the 0x7c00 address then int 0x19 searches for the MBR sector signature 0xAA55 on sector 0x7c00
about 0xAA55:
Not a checksum, but more of a signature. It does provide some simple
evidence that some MBR is present.
0xAA55 is also an alternating bit pattern: 1010101001010101
It's often used to help determine if you are on a little-endian or
big-endian system, because it will read as either AA55 or 55AA. I
suspect that is part of why it is put on the end of the MBR.
about 0x7c00:
Check this website out (this might help u in finding the answer): https://www.glamenv-septzen.net/en/view/6
This probably dead but I'm going to answer.
At the start of any bootloader when you set the origin of the segment to 0x7c00 then the registers jump address to that as well. So ideally if you check out some online resources that tell you how to use the int 0x19 command they will guide you on how to jump to another address.
To fix this you would ideally, reset the stack to 0 at the start of every jump to an new address.
( It seems like this duplicates following questions:
What is significance of memory at 0000:7c00 to booting sequence?
Does the BIOS copy the 512-byte bootloader to 0x7c00 )
Inspired by an answer to the former, I would quote two sources:
[Notes] Why is the MBR loaded to 0x7C00 on x86? (Full Edition)
So, first answer the question about the 16KB model from David Bradley's reply:
It had to boot on a 32KB machine. DOS 1.0 required a minimum of 32KB,
so we weren't concerned about attempting a boot in 16KB.
To execute DOS 1.0, at least 32KB is required, so the 16KB model has
not been considered.
Followed by the answer to "Why 32KB-1024B?":
We wanted to leave as much room as possible for the OS to load itself
within the 32KB. The 808x Intel architecture used up the first portion
of the memory range for software interrupts, and the BIOS data area
was after it. So we put the bootstrap load at 0x7C00 (32KB-1KB) to
leave all the room in between for the OS to load. The boot sector was
512 bytes, and when it executes it'll need some room for data and a
stack, so that's the other 512 bytes . So the memory map looks like
this after INT 19H executes:
Why BIOS loads MBR into 0x7C00 in x86 ?
No, that case was out of consideration. One of IBM PC 5150 ROM BIOS
Developer Team Members, Dr. David Bradley says:
"DOS 1.0 required a minimum of 32KB, so we weren't concerned about
attempting a boot in 16KB."
(Note: DOS 1.0 required 16KiB minimum ? or 32KiB ? I couldn't find out
which correct. But, at least, in 1981's early BIOS development, they
supposed that 32KiB is DOS minimum requirements.)
BIOS developer team decided 0x7C00 because:
They wanted to leave as much room as possible for the OS to load
itself within the 32KiB.
8086/8088 used 0x0 - 0x3FF for interrupts vector, and BIOS data area
was after it.
The boot sector was 512 bytes, and stack/data area for boot program
needed more 512 bytes.
So, 0x7C00, the last 1024B of 32KiB was chosen.
Once OS loaded and started, boot sector is never used until power reset.
So, OS and application can use the last 1024B of 32KiB freely.
I hope this answer is based enough to be sure why / how it happened so.

entering ring 0 from user mode

Most modern operating systems run in the protected mode. Now is it possible for the user programs to enter the "ring 0" by directly setting the corresponding bits in some control registers. Or does it have to go through some syscall.
I believe to access the hardware we need to go through the operating system. But if we know the address of the hardware device can we just write some assembly language code with reference to the location of the device and access it. What happens when we give the address of some hardware device in the assembly language code.
Thanks.
To enter Ring 0, you must perform a system call, and by its nature, the system controls where you go, because for the call you simply give an index to the CPU, and the CPU looks inside a table to know what to call. You can't really get around the security aspect (obviously) to do something else, but maybe this link will help.
You can ask the operating system to map the memory of the hardware device into the memory space of your program. Once that's done, you can just read and write that memory from ring 3. Whether that's possible to do, or how to do that, depends on the operating system or the device.
; set PE bit
mov cr0, eax
or eax, 1
mov eax, cr0
; far jump (cs = selector of code segment)
jmp cs:#pm
#pm:
; Now we are in PM
Taken from Wikipedia.
Basic idea is to set (to 1) 0th bit in cr0 control register.
But if you are already in protected mode (i.e. you are in windows/linux), security restricts you to do it (you are in ring 3 - lowest trust).
So be the first one to get into protected mode.

What is the difference between user and kernel modes in operating systems?

What are the differences between User Mode and Kernel Mode, why and how do you activate either of them, and what are their use cases?
Kernel Mode
In Kernel mode, the executing code has complete and unrestricted
access to the underlying hardware. It
can execute any CPU instruction and
reference any memory address. Kernel
mode is generally reserved for the
lowest-level, most trusted functions
of the operating system. Crashes in
kernel mode are catastrophic; they
will halt the entire PC.
User Mode
In User mode, the executing code has no ability to directly access
hardware or reference memory. Code
running in user mode must delegate to
system APIs to access hardware or
memory. Due to the protection afforded
by this sort of isolation, crashes in
user mode are always recoverable. Most
of the code running on your computer
will execute in user mode.
Read more
Understanding User and Kernel Mode
These are two different modes in which your computer can operate. Prior to this, when computers were like a big room, if something crashes – it halts the whole computer. So computer architects decide to change it. Modern microprocessors implement in hardware at least 2 different states.
User mode:
mode where all user programs execute. It does not have access to RAM
and hardware. The reason for this is because if all programs ran in
kernel mode, they would be able to overwrite each other’s memory. If
it needs to access any of these features – it makes a call to the
underlying API. Each process started by windows except of system
process runs in user mode.
Kernel mode:
mode where all kernel programs execute (different drivers). It has
access to every resource and underlying hardware. Any CPU instruction
can be executed and every memory address can be accessed. This mode
is reserved for drivers which operate on the lowest level
How the switch occurs.
The switch from user mode to kernel mode is not done automatically by CPU. CPU is interrupted by interrupts (timers, keyboard, I/O). When interrupt occurs, CPU stops executing the current running program, switch to kernel mode, executes interrupt handler. This handler saves the state of CPU, performs its operations, restore the state and returns to user mode.
http://en.wikibooks.org/wiki/Windows_Programming/User_Mode_vs_Kernel_Mode
http://tldp.org/HOWTO/KernelAnalysis-HOWTO-3.html
http://en.wikipedia.org/wiki/Direct_memory_access
http://en.wikipedia.org/wiki/Interrupt_request
CPU rings are the most clear distinction
In x86 protected mode, the CPU is always in one of 4 rings. The Linux kernel only uses 0 and 3:
0 for kernel
3 for users
This is the most hard and fast definition of kernel vs userland.
Why Linux does not use rings 1 and 2: CPU Privilege Rings: Why rings 1 and 2 aren't used?
How is the current ring determined?
The current ring is selected by a combination of:
global descriptor table: a in-memory table of GDT entries, and each entry has a field Privl which encodes the ring.
The LGDT instruction sets the address to the current descriptor table.
See also: http://wiki.osdev.org/Global_Descriptor_Table
the segment registers CS, DS, etc., which point to the index of an entry in the GDT.
For example, CS = 0 means the first entry of the GDT is currently active for the executing code.
What can each ring do?
The CPU chip is physically built so that:
ring 0 can do anything
ring 3 cannot run several instructions and write to several registers, most notably:
cannot change its own ring! Otherwise, it could set itself to ring 0 and rings would be useless.
In other words, cannot modify the current segment descriptor, which determines the current ring.
cannot modify the page tables: How does x86 paging work?
In other words, cannot modify the CR3 register, and paging itself prevents modification of the page tables.
This prevents one process from seeing the memory of other processes for security / ease of programming reasons.
cannot register interrupt handlers. Those are configured by writing to memory locations, which is also prevented by paging.
Handlers run in ring 0, and would break the security model.
In other words, cannot use the LGDT and LIDT instructions.
cannot do IO instructions like in and out, and thus have arbitrary hardware accesses.
Otherwise, for example, file permissions would be useless if any program could directly read from disk.
More precisely thanks to Michael Petch: it is actually possible for the OS to allow IO instructions on ring 3, this is actually controlled by the Task state segment.
What is not possible is for ring 3 to give itself permission to do so if it didn't have it in the first place.
Linux always disallows it. See also: Why doesn't Linux use the hardware context switch via the TSS?
How do programs and operating systems transition between rings?
when the CPU is turned on, it starts running the initial program in ring 0 (well kind of, but it is a good approximation). You can think this initial program as being the kernel (but it is normally a bootloader that then calls the kernel still in ring 0).
when a userland process wants the kernel to do something for it like write to a file, it uses an instruction that generates an interrupt such as int 0x80 or syscall to signal the kernel. x86-64 Linux syscall hello world example:
.data
hello_world:
.ascii "hello world\n"
hello_world_len = . - hello_world
.text
.global _start
_start:
/* write */
mov $1, %rax
mov $1, %rdi
mov $hello_world, %rsi
mov $hello_world_len, %rdx
syscall
/* exit */
mov $60, %rax
mov $0, %rdi
syscall
compile and run:
as -o hello_world.o hello_world.S
ld -o hello_world.out hello_world.o
./hello_world.out
GitHub upstream.
When this happens, the CPU calls an interrupt callback handler which the kernel registered at boot time. Here is a concrete baremetal example that registers a handler and uses it.
This handler runs in ring 0, which decides if the kernel will allow this action, do the action, and restart the userland program in ring 3. x86_64
when the exec system call is used (or when the kernel will start /init), the kernel prepares the registers and memory of the new userland process, then it jumps to the entry point and switches the CPU to ring 3
If the program tries to do something naughty like write to a forbidden register or memory address (because of paging), the CPU also calls some kernel callback handler in ring 0.
But since the userland was naughty, the kernel might kill the process this time, or give it a warning with a signal.
When the kernel boots, it setups a hardware clock with some fixed frequency, which generates interrupts periodically.
This hardware clock generates interrupts that run ring 0, and allow it to schedule which userland processes to wake up.
This way, scheduling can happen even if the processes are not making any system calls.
What is the point of having multiple rings?
There are two major advantages of separating kernel and userland:
it is easier to make programs as you are more certain one won't interfere with the other. E.g., one userland process does not have to worry about overwriting the memory of another program because of paging, nor about putting hardware in an invalid state for another process.
it is more secure. E.g. file permissions and memory separation could prevent a hacking app from reading your bank data. This supposes, of course, that you trust the kernel.
How to play around with it?
I've created a bare metal setup that should be a good way to manipulate rings directly: https://github.com/cirosantilli/x86-bare-metal-examples
I didn't have the patience to make a userland example unfortunately, but I did go as far as paging setup, so userland should be feasible. I'd love to see a pull request.
Alternatively, Linux kernel modules run in ring 0, so you can use them to try out privileged operations, e.g. read the control registers: How to access the control registers cr0,cr2,cr3 from a program? Getting segmentation fault
Here is a convenient QEMU + Buildroot setup to try it out without killing your host.
The downside of kernel modules is that other kthreads are running and could interfere with your experiments. But in theory you can take over all interrupt handlers with your kernel module and own the system, that would be an interesting project actually.
Negative rings
While negative rings are not actually referenced in the Intel manual, there are actually CPU modes which have further capabilities than ring 0 itself, and so are a good fit for the "negative ring" name.
One example is the hypervisor mode used in virtualization.
For further details see:
https://security.stackexchange.com/questions/129098/what-is-protection-ring-1
https://security.stackexchange.com/questions/216527/ring-3-exploits-and-existence-of-other-rings
ARM
In ARM, the rings are called Exception Levels instead, but the main ideas remain the same.
There exist 4 exception levels in ARMv8, commonly used as:
EL0: userland
EL1: kernel ("supervisor" in ARM terminology).
Entered with the svc instruction (SuperVisor Call), previously known as swi before unified assembly, which is the instruction used to make Linux system calls. Hello world ARMv8 example:
hello.S
.text
.global _start
_start:
/* write */
mov x0, 1
ldr x1, =msg
ldr x2, =len
mov x8, 64
svc 0
/* exit */
mov x0, 0
mov x8, 93
svc 0
msg:
.ascii "hello syscall v8\n"
len = . - msg
GitHub upstream.
Test it out with QEMU on Ubuntu 16.04:
sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
arm-linux-gnueabihf-as -o hello.o hello.S
arm-linux-gnueabihf-ld -o hello hello.o
qemu-arm hello
Here is a concrete baremetal example that registers an SVC handler and does an SVC call.
EL2: hypervisors, for example Xen.
Entered with the hvc instruction (HyperVisor Call).
A hypervisor is to an OS, what an OS is to userland.
For example, Xen allows you to run multiple OSes such as Linux or Windows on the same system at the same time, and it isolates the OSes from one another for security and ease of debug, just like Linux does for userland programs.
Hypervisors are a key part of today's cloud infrastructure: they allow multiple servers to run on a single hardware, keeping hardware usage always close to 100% and saving a lot of money.
AWS for example used Xen until 2017 when its move to KVM made the news.
EL3: yet another level. TODO example.
Entered with the smc instruction (Secure Mode Call)
The ARMv8 Architecture Reference Model DDI 0487C.a - Chapter D1 - The AArch64 System Level Programmer's Model - Figure D1-1 illustrates this beautifully:
The ARM situation changed a bit with the advent of ARMv8.1 Virtualization Host Extensions (VHE). This extension allows the kernel to run in EL2 efficiently:
VHE was created because in-Linux-kernel virtualization solutions such as KVM have gained ground over Xen (see e.g. AWS' move to KVM mentioned above), because most clients only need Linux VMs, and as you can imagine, being all in a single project, KVM is simpler and potentially more efficient than Xen. So now the host Linux kernel acts as the hypervisor in those cases.
Note how ARM, maybe due to the benefit of hindsight, has a better naming convention for the privilege levels than x86, without the need for negative levels: 0 being the lower and 3 highest. Higher levels tend to be created more often than lower ones.
The current EL can be queried with the MRS instruction: what is the current execution mode/exception level, etc?
ARM does not require all exception levels to be present to allow for implementations that don't need the feature to save chip area. ARMv8 "Exception levels" says:
An implementation might not include all of the Exception levels. All implementations must include EL0 and EL1.
EL2 and EL3 are optional.
QEMU for example defaults to EL1, but EL2 and EL3 can be enabled with command line options: qemu-system-aarch64 entering el1 when emulating a53 power up
Code snippets tested on Ubuntu 18.10.
A processor in a computer running Windows has two different modes: user mode and kernel mode. The processor switches between the two modes depending on what type of code is running on the processor. Applications run in user mode, and core operating system components run in kernel mode. While many drivers run in kernel mode, some drivers may run in user mode.
When you start a user-mode application, Windows creates a process for the application. The process provides the application with a private virtual address space and a private handle table. Because an application's virtual address space is private, one application cannot alter data that belongs to another application. Each application runs in isolation, and if an application crashes, the crash is limited to that one application. Other applications and the operating system are not affected by the crash.
In addition to being private, the virtual address space of a user-mode application is limited. A processor running in user mode cannot access virtual addresses that are reserved for the operating system. Limiting the virtual address space of a user-mode application prevents the application from altering, and possibly damaging, critical operating system data.
All code that runs in kernel mode shares a single virtual address space. This means that a kernel-mode driver is not isolated from other drivers and the operating system itself. If a kernel-mode driver accidentally writes to the wrong virtual address, data that belongs to the operating system or another driver could be compromised. If a kernel-mode driver crashes, the entire operating system crashes.
If you are a Windows user once go through this link you will get more.
Communication between user mode and kernel mode
I'm going to take a stab in the dark and guess you're talking about Windows. In a nutshell, kernel mode has full access to hardware, but user mode doesn't. For instance, many if not most device drivers are written in kernel mode because they need to control finer details of their hardware.
See also this wikibook.
Other answers already explained the difference between user and kernel mode. If you really want to get into detail you should get a copy of
Windows Internals, an excellent book written by Mark Russinovich and David Solomon describing the architecture and inside details of the various Windows operating systems.
What
Basically the difference between kernel and user modes is not OS dependent and is achieved only by restricting some instructions to be run only in kernel mode by means of hardware design. All other purposes like memory protection can be done only by that restriction.
How
It means that the processor lives in either the kernel mode or in the user mode. Using some mechanisms the architecture can guarantee that whenever it is switched to the kernel mode the OS code is fetched to be run.
Why
Having this hardware infrastructure these could be achieved in common OSes:
Protecting user programs from accessing whole the memory, to not let programs overwrite the OS for example,
preventing user programs from performing sensitive instructions such as those that change CPU memory pointer bounds, to not let programs break their memory bounds for example.

What is INT 21h?

Inspired by this question
How can I force GDB to disassemble?
I wondered about the INT 21h as a concept. Now, I have some very rusty knowledge of the internals, but not so many details. I remember that in C64 you had regular Interrupts and Non Maskable Interrupts, but my knowledge stops here. Could you please give me some clue ? Is it a DOS related strategy ?
From here:
A multipurpose DOS interrupt used for various functions including reading the keyboard and writing to the console and printer. It was also used to read and write disks using the earlier File Control Block (FCB) method.
DOS can be thought of as a library used to provide a files/directories abstraction for the PC (-and a bit more). int 21h is a simple hardware "trick" that makes it easy to call code from this library without knowing in advance where it will be located in memory. Alternatively, you can think of this as the way to utilise the DOS API.
Now, the topic of software interrupts is a complex one, partly because the concepts evolved over time as Intel added features to the x86 family, while trying to remain compatible with old software. A proper explanation would take a few pages, but I'll try to be brief.
The main question is whether you are in real mode or protected mode.
Real mode is the simple, "original" mode of operation for the x86 processor. This is the mode that DOS runs in (when you run DOS programs under Windows, a real mode processor is virtualised, so within it the same rules apply). The currently running program has full control over the processor.
In real mode, there is a vector table that tells the processor which address to jump to for every interrupt from 0 to 255. This table is populated by the BIOS and DOS, as well as device drivers, and sometimes programs with special needs. Some of these interrupts can be generated by hardware (e.g. by a keypress). Others are generated by certain software conditions (e.g. divide by 0). Any of them can be generated by executing the int n instruction.
Programs can set/clear the "enable interrupts" flag; this flag affects hardware interrupts only and does not affect int instructions.
The DOS designers chose to use interrupt number 21h to handle DOS requests - the number is of no real significance: it was just an unused entry at the time. There are many others (number 10h is a BIOS-installed interrupt routine that deals with graphics, for instance). Also note that all this is for IBM PC compatibles only. x86 processors in say embedded systems may have their software and interrupt tables arranged quite differently!
Protected mode is the complex, "security-aware" mode that was introduced in the 286 processor and much extended on the 386. It provides multiple privilege levels. The OS must configure all of this (and if the OS gets it wrong, you have a potential security exploit). User programs are generally confined to a "minimal privilege" mode of operation, where trying to access hardware ports, or changing the interrupt flag, or accessing certain memory regions, halts the program and allows the OS to decide what to do (be it terminate the program or give the program what it seems to want).
Interrupt handling is made more complex. Suffice to say that generally, if a user program does a software interrupt, the interrupt number is not used as a vector into the interrupt table. Rather a general protection exception is generated and the OS handler for said exception may (if the OS is design this way) work out what the process wants and service the request. I'm pretty sure Linux and Windows have in the past (if not currently) used this sort of mechanism for their system calls. But there are other ways to achieve this, such as the SYSENTER instruction.
Ralph Brown's interrupt list contains a lot of information on which interrupt does what. int 21, like all others, supports a wide range of functionality depending on register values.
A non-HTML version of Ralph Brown's list is also available.
The INT instruction is a software interrupt. It causes a jump to a routine pointed to by an interrupt vector, which is a fixed location in memory. The advantage of the INT instruction is that is only 2 bytes long, as oposed to maybe 6 for a JMP, and that it can easily be re-directed by modifying the contents of the interrupt vector.
Int 0x21 is an x86 software interrupt - basically that means there is an interrupt table at a fixed point in memory listing the addresses of software interrupt functions. When an x86 CPU receives the interrupt opcode (or otherwise decides that a particular software interrupt should be executed), it references that table to execute a call to that point (the function at that point must use iret instead of ret to return).
It is possible to remap Int 0x21 and other software interrupts (even inside DOS though this can have negative side effects). One interesting software interrupt to map or chain is Int 0x1C (or 0x08 if you are careful), which is the system tick interrupt, called 18.2 times every second. This can be used to create "background" processes, even in single threaded real mode (the real mode process will be interrupted 18.2 times a second to call your interrupt function).
On the DOS operating system (or a system that is providing some DOS emulation, such as Windows console) Int 0x21 is mapped to what is effectively the DOS operating systems main "API". By providing different values to the AH register, different DOS functions can be executed such as opening a file (AH=0x3D) or printing to the screen (AH=0x09).
This is from the great The Art of Assembly Language Programming about interrupts:
On the 80x86, there are three types of events commonly known as
interrupts: traps, exceptions, and interrupts (hardware interrupts).
This chapter will describe each of these forms and discuss their
support on the 80x86 CPUs and PC compatible machines.
Although the terms trap and exception are often used synonymously, we
will use the term trap to denote a programmer initiated and expected
transfer of control to a special handler routine. In many respects, a
trap is nothing more than a specialized subroutine call. Many texts
refer to traps as software interrupts. The 80x86 int instruction is
the main vehicle for executing a trap. Note that traps are usually
unconditional; that is, when you execute an int instruction, control
always transfers to the procedure associated with the trap. Since
traps execute via an explicit instruction, it is easy to determine
exactly which instructions in a program will invoke a trap handling
routine.
Chapter 17 - Interrupt Structure and Interrupt Service Routines
(Almost) the whole DOS interface was made available as INT21h commands, with parameters in the various registers. It's a little trick, using a built-in-hardware table to jump to the right code. Also INT 33h was for the mouse.
It's a "software interrupt"; so not a hardware interrupt at all.
When an application invokes a software interrupt, that's essentially the same as its making a subroutine call, except that (unlike a subroutine call) the doesn't need to know the exact memory address of the code it's invoking.
System software (e.g. DOS and the BIOS) expose their APIs to the application as software interrupts.
The software interrupt is therefore a kind of dynamic-linking.
Actually, there are a lot of concepts here. Let's start with the basics.
An interrupt is a mean to request attention from the CPU, to interrupt the current program flow, jump to an interrupt handler (ISR - Interrupt Service Routine), do some work (usually by the OS kernel or a device driver) and then return.
What are some typical uses for interrupts?
Hardware interrupts: A device requests attention from the CPU by issuing an interrupt request.
CPU Exceptions: If some abnormal CPU condition happens, such as a division by zero, a page fault, ... the CPU jumps to the corresponding interrupt handler so the OS can do whatever it has to do (send a signal to a process, load a page from swap and update the TLB/page table, ...).
Software interrupts: Since an interrupt ends up calling the OS kernel, a simple way to implement system calls is to use interrupts. But you don't need to, in x86 you could use a call instruction to some structure (some kind of TSS IIRC), and on newer x86 there are SYSCALL / SYSENTER intructions.
CPUs decide where to jump to looking at a table (exception vectors, interrupt vectors, IVT in x86 real mode, IDT in x86 protected mode, ...). Some CPUs have a single vector for hardware interrupts, another one for exceptions and so on, and the ISR has to do some work to identify the originator of the interrupt. Others have lots of vectors, and jump directly to very specific ISRs.
x86 has 256 interrupt vectors. On original PCs, these were divided into several groups:
00-04 CPU exceptions, including NMI. With later CPUs (80186, 286, ...), this range expanded, overlapping with the following ranges.
08-0F These are hardware interrupts, usually referred as IRQ0-7. The PC-AT added IRQ8-15
10-1F BIOS calls. Conceptually, these can be considered system calls, since the BIOS is the part of DOS that depends on the concrete machine (that's how it was defined in CP/M).
20-2F DOS calls. Some of these are multiplexed, and offer multitude of functions. The main one is INT 21h, which offers most of DOS services.
30-FF The rest, for use by external drivers and user programs.