Can SIPI be sent from a BSP running in long mode? - operating-system

Currently I have an multi-processing operating system running in x86 protected mode, and I want to make it run in x86_64 long mode. Its current logic to wake up APs is by sending SIPI-INIT-INIT:
// BSP already entered protected mode, set up page tables
uint32_t *icr = 0xfee00300;
*icr = 0x000c4500ul; // send INIT
delay_us(10000); // delay for 10 ms
while (*icr & 0x1000); // wait until Send Pending bit is clear
for (int i = 0; i < 2; i++) {
*icr = 0x000c4610ul; // send SIPI
delay_us(200); // delay for 200 us
while (*icr & 0x1000); // wait until Send Pending bit is clear
}
This program works well in 32-bit protected mode.
However, after I modified the operating system to run in 64-bit long mode, the logic breaks when sending SIPI. In QEMU, immediately after executing the send SIPI line, the BSP is reset (program counter goes to 0xfff0).
In Intel's software developer's manual volume 3, section 8.4.4.1 (Typical BSP Initialization Sequence), it says that BSP should "Switches to protected mode". Does this process apply to long mode? How should I debug this problem?
Here are some debug information, if helpful:
CPU registers before sending SIPI instruction (movl $0xc4610,(%rax)) in 64-bit long mode:
rax 0xfee00300 4276093696
rbx 0x40 64
rcx 0x0 0
rdx 0x61 97
rsi 0x61 97
rdi 0x0 0
rbp 0x1996ff78 0x1996ff78
rsp 0x1996ff38 0x1996ff38
r8 0x1996ff28 429326120
r9 0x2 2
r10 0x0 0
r11 0x0 0
r12 0x0 0
r13 0x0 0
r14 0x0 0
r15 0x0 0
rip 0x1020d615 0x1020d615
eflags 0x97 [ IOPL=0 SF AF PF CF ]
cs 0x10 16
ss 0x18 24
ds 0x18 24
es 0x18 24
fs 0x18 24
gs 0x18 24
fs_base 0x0 0
gs_base 0x0 0
k_gs_base 0x0 0
cr0 0x80000011 [ PG ET PE ]
cr2 0x0 0
cr3 0x19948000 [ PDBR=12 PCID=0 ]
cr4 0x20 [ PAE ]
cr8 0x0 0
efer 0x500 [ LMA LME ]
mxcsr 0x1f80 [ IM DM ZM OM UM PM ]
CPU registers before sending SIPI instruction (movl $0xc4610,(%eax)) in 32-bit protected mode:
rax 0xfee00300 4276093696
rbx 0x40000 262144
rcx 0x0 0
rdx 0x61 97
rsi 0x2 2
rdi 0x102110eb 270602475
rbp 0x19968f4c 0x19968f4c
rsp 0x19968f04 0x19968f04
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
r13 0x0 0
r14 0x0 0
r15 0x0 0
rip 0x1020d075 0x1020d075
eflags 0x97 [ IOPL=0 SF AF PF CF ]
cs 0x8 8
ss 0x10 16
ds 0x10 16
es 0x10 16
fs 0x10 16
gs 0x10 16
fs_base 0x0 0
gs_base 0x0 0
k_gs_base 0x0 0
cr0 0x80000015 [ PG ET EM PE ]
cr2 0x0 0
cr3 0x19942000 [ PDBR=12 PCID=0 ]
cr4 0x30 [ PAE PSE ]
cr8 0x0 0
efer 0x0 [ ]
mxcsr 0x1f80 [ IM DM ZM OM UM PM ]

Can SIPI be sent from a BSP running in long mode?
Yes. The only thing that matters is that you write the right values to the right local APIC registers (with the right delays, sort of - see my method at the end).
However, after I modified the operating system to run in 64-bit long mode, the logic breaks when sending SIPI. In QEMU, immediately after executing the send SIPI line, the BSP is reset (program counter goes to 0xfff0).
I'd assume that either:
a) there's a bug and the address of the local APIC's registers isn't right; causing a triple fault when you attempt to write to the local APIC's register. Don't forget that long mode must use paging, and even though 0xFEE00300 is likely to be the correct physical address it can be the wrong virtual address (unless you took care of that by identity mapping that specific page when porting the OS to long mode).
b) The data isn't right for some hard to imagine reason, causing the SIPI to restart the BSP.
In Intel's software developer's manual volume 3, section 8.4.4.1 (Typical BSP Initialization Sequence), it says that BSP should "Switches to protected mode". Does this process apply to long mode?
Intel's "Typical BSP Initialization Sequence" is just one possible example that's only intended for firmware developers. Note that "intended for firmware developers" means that it should not be used by any OS.
The main problem with Intel's example is that it broadcasts the INIT-SIPI-SIPI sequence to all other CPUs (possibly including CPUs that the firmware disabled because they're faulty, and possibly including CPUs that the firmware disabled for other reasons - e.g. because the user disabled hyper-threading); and fails to detect "CPU exists but failed to start for some reason" (which an OS should report to user).
The other problem is that typically an OS will want to pre-allocate a stack for each AP before starting it (and store "address you should use for stack" somewhere before starting an AP), and you can't give each AP its own stack like that if you're starting an unknown number of CPUs at the same time.
Essentially; firmware uses (something like) the example Intel described, then builds information in an "ACPI/MADT" ACPI table (and/or a "MultiProcessor specification table" for very old computers - it's obsolete now) for the OS to use; and the OS uses information from the firmware's table/s to find the physical address of the local APIC in a correct (vendor and platform neutral) way, and find only the CPUs that the firmware says are valid and determine if those CPU/s are using "local APIC" or "X2APIC" (which supports more than 256 APIC IDs and is necessary if there's a huge number of CPUs); and then only starts valid CPUs one at a time while using a time-out so that "CPU #123, that I have proof exists, has failed to start" can be reported to user and/or logged.
I should also point out that Intel's example has existed in Intel's manuals mostly unchanged for about 25 years (since before long mode was introduced).
My Method
The delays in Intel's algorithm are annoying, and often a CPU will start on the first SIPI, and sometimes the second SIPI will cause the same CPU to be started twice (causing problems if you have any kind of "started_CPUs++;" in the AP startup code).
To fix these problems (and improve performance) the AP startup code can set an "I started" flag, and instead of having a "delay_us(200);" after the sending the first SIPI the BSP can monitor the "I started" flag with a time-out, and skip the second SIPI (and the remainder of the time-out) if the AP already started. In this case the time-out between SIPIs can be longer (e.g. 500 us is fine) and more importantly needn't be so precise; and the same "wait for flag with time-out" code can be re-used after sending the second SIPI (if the second SIPI needed to be sent), with a much longer time-out.
This alone doesn't completely solve the "CPU started twice" problem; and it doesn't solve the "AP started after second SIPI, but started after time-out expired, so now there's 2 APs running and OS only knows about one". These problems are fixed with extra synchronization - specifically, AP sets the "I started flag" and then it can wait for BSP to set a "you can continue if your APIC ID is ...." value to be set (and if the AP detects that the APIC ID value is wrong it can do a "CLI then HLT" loop to shut itself down).
Finally; if you do the whole "INIT-SIPI-SIPI" sequence one CPU at a time, then it can be slow if there's lots of CPUs (e.g. at least a whole second for 100 CPUs due to the 10 ms delay after sending INIT). This can be significantly reduced by using 2 different methods:
a) starting CPUs in parallel. For best case; BSP can start 1 AP, then BSP+AP can start 2 more APs, then BSP+3 APs can start 4 more APs, etc. This means 128 CPUs can be started in slightly more than 70 ms (instead of over a whole second). To make this work (to give each AP different values to use for stack, etc) it's best to use multiple AP CPU startup trampolines (e.g. so that an AP can do "mov esp,[cs:stackPointer]" where different APs are started with different values in cs because that came from the SIPI).
b) Sending multiple INITs to multiple CPUs one at a time; then having one 10 ms delay; then doing the later "SIPI-SIPI" sequence one CPU at a time. This relies on the later "SIPI-SIPI" sequence being relatively fast (compared to the huge 10 ms delay after INIT) and the CPU not being too fussy about the exact length of that 10 ms delay. For example; if you send 4 INITs to 4 CPUs and you know that (for a worst case) the SIPI-SIPI takes 1 ms for the OS to decide that the CPU failed to start; then there'd be a delay of 13 ms between sending the INIT to the fourth/last CPU and sending the first SIPI to the fourth/last CPU.
Note that if you're brave, both of these approaches can be combined (e.g. you could start 128 CPUs in a little more than 50 ms).

Related

Where does the value returned by the syscalls stored in linux? [duplicate]

Following links explain x86-32 system call conventions for both UNIX (BSD flavor) & Linux:
http://www.int80h.org/bsdasm/#system-calls
http://www.freebsd.org/doc/en/books/developers-handbook/x86-system-calls.html
But what are the x86-64 system call conventions on both UNIX & Linux?
Further reading for any of the topics here: The Definitive Guide to Linux System Calls
I verified these using GNU Assembler (gas) on Linux.
Kernel Interface
x86-32 aka i386 Linux System Call convention:
In x86-32 parameters for Linux system call are passed using registers. %eax for syscall_number. %ebx, %ecx, %edx, %esi, %edi, %ebp are used for passing 6 parameters to system calls.
The return value is in %eax. All other registers (including EFLAGS) are preserved across the int $0x80.
I took following snippet from the Linux Assembly Tutorial but I'm doubtful about this. If any one can show an example, it would be great.
If there are more than six arguments,
%ebx must contain the memory
location where the list of arguments
is stored - but don't worry about this
because it's unlikely that you'll use
a syscall with more than six
arguments.
For an example and a little more reading, refer to http://www.int80h.org/bsdasm/#alternate-calling-convention. Another example of a Hello World for i386 Linux using int 0x80: Hello, world in assembly language with Linux system calls?
There is a faster way to make 32-bit system calls: using sysenter. The kernel maps a page of memory into every process (the vDSO), with the user-space side of the sysenter dance, which has to cooperate with the kernel for it to be able to find the return address. Arg to register mapping is the same as for int $0x80. You should normally call into the vDSO instead of using sysenter directly. (See The Definitive Guide to Linux System Calls for info on linking and calling into the vDSO, and for more info on sysenter, and everything else to do with system calls.)
x86-32 [Free|Open|Net|DragonFly]BSD UNIX System Call convention:
Parameters are passed on the stack. Push the parameters (last parameter pushed first) on to the stack. Then push an additional 32-bit of dummy data (Its not actually dummy data. refer to following link for more info) and then give a system call instruction int $0x80
http://www.int80h.org/bsdasm/#default-calling-convention
x86-64 Linux System Call convention:
(Note: x86-64 Mac OS X is similar but different from Linux. TODO: check what *BSD does)
Refer to section: "A.2 AMD64 Linux Kernel Conventions" of System V Application Binary Interface AMD64 Architecture Processor Supplement. The latest versions of the i386 and x86-64 System V psABIs can be found linked from this page in the ABI maintainer's repo. (See also the x86 tag wiki for up-to-date ABI links and lots of other good stuff about x86 asm.)
Here is the snippet from this section:
User-level applications use as integer registers for passing the
sequence %rdi, %rsi, %rdx, %rcx,
%r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.
A system-call is done via the syscall instruction. This clobbers %rcx and %r11 as well as the %rax return value, but other registers are preserved.
The number of the syscall has to be passed in register %rax.
System-calls are limited to six arguments, no argument is passed
directly on the stack.
Returning from the syscall, register %rax contains the result of
the system-call. A value in the range between -4095 and -1 indicates
an error, it is -errno.
Only values of class INTEGER or class MEMORY are passed to the kernel.
Remember this is from the Linux-specific appendix to the ABI, and even for Linux it's informative not normative. (But it is in fact accurate.)
This 32-bit int $0x80 ABI is usable in 64-bit code (but highly not recommended). What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? It still truncates its inputs to 32-bit, so it's unsuitable for pointers, and it zeros r8-r11.
User Interface: function calling
x86-32 Function Calling convention:
In x86-32 parameters were passed on stack. Last parameter was pushed first on to the stack until all parameters are done and then call instruction was executed. This is used for calling C library (libc) functions on Linux from assembly.
Modern versions of the i386 System V ABI (used on Linux) require 16-byte alignment of %esp before a call, like the x86-64 System V ABI has always required. Callees are allowed to assume that and use SSE 16-byte loads/stores that fault on unaligned. But historically, Linux only required 4-byte stack alignment, so it took extra work to reserve naturally-aligned space even for an 8-byte double or something.
Some other modern 32-bit systems still don't require more than 4 byte stack alignment.
x86-64 System V user-space Function Calling convention:
x86-64 System V passes args in registers, which is more efficient than i386 System V's stack args convention. It avoids the latency and extra instructions of storing args to memory (cache) and then loading them back again in the callee. This works well because there are more registers available, and is better for modern high-performance CPUs where latency and out-of-order execution matter. (The i386 ABI is very old).
In this new mechanism: First the parameters are divided into classes. The class of each parameter determines the manner in which it is passed to the called function.
For complete information refer to : "3.2 Function Calling Sequence" of System V Application Binary Interface AMD64 Architecture Processor Supplement which reads, in part:
Once arguments are classified, the registers get assigned (in
left-to-right order) for passing as follows:
If the class is MEMORY, pass the argument on the stack.
If the class is INTEGER, the next available register of the
sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used
So %rdi, %rsi, %rdx, %rcx, %r8 and %r9 are the registers in order used to pass integer/pointer (i.e. INTEGER class) parameters to any libc function from assembly. %rdi is used for the first INTEGER parameter. %rsi for 2nd, %rdx for 3rd and so on. Then call instruction should be given. The stack (%rsp) must be 16B-aligned when call executes.
If there are more than 6 INTEGER parameters, the 7th INTEGER parameter and later are passed on the stack. (Caller pops, same as x86-32.)
The first 8 floating point args are passed in %xmm0-7, later on the stack. There are no call-preserved vector registers. (A function with a mix of FP and integer arguments can have more than 8 total register arguments.)
Variadic functions (like printf) always need %al = the number of FP register args.
There are rules for when to pack structs into registers (rdx:rax on return) vs. in memory. See the ABI for details, and check compiler output to make sure your code agrees with compilers about how something should be passed/returned.
Note that the Windows x64 function calling convention has multiple significant differences from x86-64 System V, like shadow space that must be reserved by the caller (instead of a red-zone), and call-preserved xmm6-xmm15. And very different rules for which arg goes in which register.
Perhaps you're looking for the x86_64 ABI?
www.x86-64.org/documentation/abi.pdf (404 at 2018-11-24)
www.x86-64.org/documentation/abi.pdf (via Wayback Machine at 2018-11-24)
Where is the x86-64 System V ABI documented? - https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI is kept up to date (by HJ Lu, one of the ABI maintainers) with links to PDFs of the official current version.
If that's not precisely what you're after, use 'x86_64 abi' in your preferred search engine to find alternative references.
Linux kernel 5.0 source comments
I knew that x86 specifics are under arch/x86, and that syscall stuff goes under arch/x86/entry. So a quick git grep rdi in that directory leads me to arch/x86/entry/entry_64.S:
/*
* 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
*
* This is the only entry point used for 64-bit system calls. The
* hardware interface is reasonably well designed and the register to
* argument mapping Linux uses fits well with the registers that are
* available when SYSCALL is used.
*
* SYSCALL instructions can be found inlined in libc implementations as
* well as some other programs and libraries. There are also a handful
* of SYSCALL instructions in the vDSO used, for example, as a
* clock_gettimeofday fallback.
*
* 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
* then loads new ss, cs, and rip from previously programmed MSRs.
* rflags gets masked by a value from another MSR (so CLD and CLAC
* are not needed). SYSCALL does not save anything on the stack
* and does not change rsp.
*
* Registers on entry:
* rax system call number
* rcx return address
* r11 saved rflags (note: r11 is callee-clobbered register in C ABI)
* rdi arg0
* rsi arg1
* rdx arg2
* r10 arg3 (needs to be moved to rcx to conform to C ABI)
* r8 arg4
* r9 arg5
* (note: r12-r15, rbp, rbx are callee-preserved in C ABI)
*
* Only called from user space.
*
* When user can change pt_regs->foo always force IRET. That is because
* it deals with uncanonical addresses better. SYSRET has trouble
* with them due to bugs in both AMD and Intel CPUs.
*/
and for 32-bit at arch/x86/entry/entry_32.S:
/*
* 32-bit SYSENTER entry.
*
* 32-bit system calls through the vDSO's __kernel_vsyscall enter here
* if X86_FEATURE_SEP is available. This is the preferred system call
* entry on 32-bit systems.
*
* The SYSENTER instruction, in principle, should *only* occur in the
* vDSO. In practice, a small number of Android devices were shipped
* with a copy of Bionic that inlined a SYSENTER instruction. This
* never happened in any of Google's Bionic versions -- it only happened
* in a narrow range of Intel-provided versions.
*
* SYSENTER loads SS, ESP, CS, and EIP from previously programmed MSRs.
* IF and VM in RFLAGS are cleared (IOW: interrupts are off).
* SYSENTER does not save anything on the stack,
* and does not save old EIP (!!!), ESP, or EFLAGS.
*
* To avoid losing track of EFLAGS.VM (and thus potentially corrupting
* user and/or vm86 state), we explicitly disable the SYSENTER
* instruction in vm86 mode by reprogramming the MSRs.
*
* Arguments:
* eax system call number
* ebx arg1
* ecx arg2
* edx arg3
* esi arg4
* edi arg5
* ebp user stack
* 0(%ebp) arg6
*/
glibc 2.29 Linux x86_64 system call implementation
Now let's cheat by looking at a major libc implementations and see what they are doing.
What could be better than looking into glibc that I'm using right now as I write this answer? :-)
glibc 2.29 defines x86_64 syscalls at sysdeps/unix/sysv/linux/x86_64/sysdep.h and that contains some interesting code, e.g.:
/* The Linux/x86-64 kernel expects the system call parameters in
registers according to the following table:
syscall number rax
arg 1 rdi
arg 2 rsi
arg 3 rdx
arg 4 r10
arg 5 r8
arg 6 r9
The Linux kernel uses and destroys internally these registers:
return address from
syscall rcx
eflags from syscall r11
Normal function call, including calls to the system call stub
functions in the libc, get the first six parameters passed in
registers and the seventh parameter and later on the stack. The
register use is as follows:
system call number in the DO_CALL macro
arg 1 rdi
arg 2 rsi
arg 3 rdx
arg 4 rcx
arg 5 r8
arg 6 r9
We have to take care that the stack is aligned to 16 bytes. When
called the stack is not aligned since the return address has just
been pushed.
Syscalls of more than 6 arguments are not supported. */
and:
/* Registers clobbered by syscall. */
# define REGISTERS_CLOBBERED_BY_SYSCALL "cc", "r11", "cx"
#undef internal_syscall6
#define internal_syscall6(number, err, arg1, arg2, arg3, arg4, arg5, arg6) \
({ \
unsigned long int resultvar; \
TYPEFY (arg6, __arg6) = ARGIFY (arg6); \
TYPEFY (arg5, __arg5) = ARGIFY (arg5); \
TYPEFY (arg4, __arg4) = ARGIFY (arg4); \
TYPEFY (arg3, __arg3) = ARGIFY (arg3); \
TYPEFY (arg2, __arg2) = ARGIFY (arg2); \
TYPEFY (arg1, __arg1) = ARGIFY (arg1); \
register TYPEFY (arg6, _a6) asm ("r9") = __arg6; \
register TYPEFY (arg5, _a5) asm ("r8") = __arg5; \
register TYPEFY (arg4, _a4) asm ("r10") = __arg4; \
register TYPEFY (arg3, _a3) asm ("rdx") = __arg3; \
register TYPEFY (arg2, _a2) asm ("rsi") = __arg2; \
register TYPEFY (arg1, _a1) asm ("rdi") = __arg1; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4), \
"r" (_a5), "r" (_a6) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
which I feel are pretty self explanatory. Note how this seems to have been designed to exactly match the calling convention of regular System V AMD64 ABI functions: https://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions
Quick reminder of the clobbers:
cc means flag registers. But Peter Cordes comments that this is unnecessary here.
memory means that a pointer may be passed in assembly and used to access memory
For an explicit minimal runnable example from scratch see this answer: How to invoke a system call via syscall or sysenter in inline assembly?
Make some syscalls in assembly manually
Not very scientific, but fun:
x86_64.S
.text
.global _start
_start:
asm_main_after_prologue:
/* write */
mov $1, %rax /* syscall number */
mov $1, %rdi /* stdout */
mov $msg, %rsi /* buffer */
mov $len, %rdx /* len */
syscall
/* exit */
mov $60, %rax /* syscall number */
mov $0, %rdi /* exit status */
syscall
msg:
.ascii "hello\n"
len = . - msg
GitHub upstream.
Make system calls from C
Here's an example with register constraints: How to invoke a system call via syscall or sysenter in inline assembly?
aarch64
I've shown a minimal runnable userland example at: https://reverseengineering.stackexchange.com/questions/16917/arm64-syscalls-table/18834#18834 TODO grep kernel code here, should be easy.
Calling conventions defines how parameters are passed in the registers when calling or being called by other program. And the best source of these convention is in the form of ABI standards defined for each these hardware. For ease of compilation, the same ABI is also used by userspace and kernel program. Linux/Freebsd follow the same ABI for x86-64 and another set for 32-bit. But x86-64 ABI for Windows is different from Linux/FreeBSD. And generally ABI does not differentiate system call vs normal "functions calls".
Ie, here is a particular example of x86_64 calling conventions and it is the same for both Linux userspace and kernel: http://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/ (note the sequence a,b,c,d,e,f of parameters):
Performance is one of the reasons for these ABI (eg, passing parameters via registers instead of saving into memory stacks)
For ARM there is various ABI:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.swdev.abi/index.html
https://developer.apple.com/library/ios/documentation/Xcode/Conceptual/iPhoneOSABIReference/iPhoneOSABIReference.pdf
ARM64 convention:
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf
For Linux on PowerPC:
http://refspecs.freestandards.org/elf/elfspec_ppc.pdf
http://www.0x04.net/doc/elf/psABI-ppc64.pdf
And for embedded there is the PPC EABI:
http://www.freescale.com/files/32bit/doc/app_note/PPCEABI.pdf
This document is good overview of all the different conventions:
http://www.agner.org/optimize/calling_conventions.pdf

Trouble Figuring out loading to register with offset from different register

I am creating an 8-bit CPU. I have basic instructions like mov, ld, st, add, sub, mult, jmp. I am trying to put my instructions together. First I move the base address of a value into register 1 (R1). I then want to load register 2 (R2) with the value. So my instructions look:
1 mov R1, 0xFFFF
2 ld R2, [R1+0]
My opcode definitions are:
ld: 0001
mov: 1111
Register codes are:
R1: 0001
R2: 0010
So my instructions in binary look like:
1 mov R1, 0xFFFF = 1111 0001 0xFFFF
2 ld R2, [R1+0] = 0001 00010
But on my second direction for load, how can I ensure the value stored at the memory location I moved to R1 is going to be used. This is my first time doing anything with computer architecture, so I am a little lost.
how can I ensure the value stored at the memory location I moved to R1 is going to be used.
By building your hardware to correctly handle the read-after-write hazard (https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards).
Either
make it a simple non-pipelined CPU where one instruction writes back to registers before the next instruction reads any registers
detect the dependency and stall the pipeline
bypass forwarding. (https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Eliminating_hazards)

Privileged instructions in Intel x86-64 [duplicate]

According to this source (Level 3 - 5) specific CPU rings can not do certain things, such as ring 1, 2, 3 code can not set up GDT, as os kernel would crash.
While it is obvious that Ring 0 can execute all instructions, I am wondering which instructions can not be issued in rings 1, 2 and 3?
I could not find anything on either wikipedia or osdev and similar sources which would state what instructions can not be issued in specific ring.
The following instructions cannot be executed in Ring 3:
LGDT
LLDT
LTR
LIDT
MOV (to and from control registers only)
MOV (to and from debug registers only)
LMSW
CLTS
INVD
WBINVD
INVLPG
HLT
RDMSR
WRMSR
RDPMC
RDTSC

What are Ring 0 and Ring 3 in the context of operating systems?

I've been learning basics about driver development in Windows I keep finding the terms Ring 0 and Ring 3. What do these refer to? Are they the same thing as kernel mode and user mode?
Linux x86 ring usage overview
Understanding how rings are used in Linux will give you a good idea of what they are designed for.
In x86 protected mode, the CPU is always in one of 4 rings. The Linux kernel only uses 0 and 3:
0 for kernel
3 for users
This is the most hard and fast definition of kernel vs userland.
Why Linux does not use rings 1 and 2: CPU Privilege Rings: Why rings 1 and 2 aren't used?
How is the current ring determined?
The current ring is selected by a combination of:
global descriptor table: a in-memory table of GDT entries, and each entry has a field Privl which encodes the ring.
The LGDT instruction sets the address to the current descriptor table.
See also: http://wiki.osdev.org/Global_Descriptor_Table
the segment registers CS, DS, etc., which point to the index of an entry in the GDT.
For example, CS = 0 means the first entry of the GDT is currently active for the executing code.
What can each ring do?
The CPU chip is physically built so that:
ring 0 can do anything
ring 3 cannot run several instructions and write to several registers, most notably:
cannot change its own ring! Otherwise, it could set itself to ring 0 and rings would be useless.
In other words, cannot modify the current segment descriptor, which determines the current ring.
cannot modify the page tables: How does x86 paging work?
In other words, cannot modify the CR3 register, and paging itself prevents modification of the page tables.
This prevents one process from seeing the memory of other processes for security / ease of programming reasons.
cannot register interrupt handlers. Those are configured by writing to memory locations, which is also prevented by paging.
Handlers run in ring 0, and would break the security model.
In other words, cannot use the LGDT and LIDT instructions.
cannot do IO instructions like in and out, and thus have arbitrary hardware accesses.
Otherwise, for example, file permissions would be useless if any program could directly read from disk.
More precisely thanks to Michael Petch: it is actually possible for the OS to allow IO instructions on ring 3, this is actually controlled by the Task state segment.
What is not possible is for ring 3 to give itself permission to do so if it didn't have it in the first place.
Linux always disallows it. See also: Why doesn't Linux use the hardware context switch via the TSS?
How do programs and operating systems transition between rings?
when the CPU is turned on, it starts running the initial program in ring 0 (well kind of, but it is a good approximation). You can think this initial program as being the kernel (but it is normally a bootloader that then calls the kernel still in ring 0).
when a userland process wants the kernel to do something for it like write to a file, it uses an instruction that generates an interrupt such as int 0x80 or syscall to signal the kernel. x86-64 Linux syscall hello world example:
.data
hello_world:
.ascii "hello world\n"
hello_world_len = . - hello_world
.text
.global _start
_start:
/* write */
mov $1, %rax
mov $1, %rdi
mov $hello_world, %rsi
mov $hello_world_len, %rdx
syscall
/* exit */
mov $60, %rax
mov $0, %rdi
syscall
compile and run:
as -o hello_world.o hello_world.S
ld -o hello_world.out hello_world.o
./hello_world.out
GitHub upstream.
When this happens, the CPU calls an interrupt callback handler which the kernel registered at boot time. Here is a concrete baremetal example that registers a handler and uses it.
This handler runs in ring 0, which decides if the kernel will allow this action, do the action, and restart the userland program in ring 3. x86_64
when the exec system call is used (or when the kernel will start /init), the kernel prepares the registers and memory of the new userland process, then it jumps to the entry point and switches the CPU to ring 3
If the program tries to do something naughty like write to a forbidden register or memory address (because of paging), the CPU also calls some kernel callback handler in ring 0.
But since the userland was naughty, the kernel might kill the process this time, or give it a warning with a signal.
When the kernel boots, it setups a hardware clock with some fixed frequency, which generates interrupts periodically.
This hardware clock generates interrupts that run ring 0, and allow it to schedule which userland processes to wake up.
This way, scheduling can happen even if the processes are not making any system calls.
What is the point of having multiple rings?
There are two major advantages of separating kernel and userland:
it is easier to make programs as you are more certain one won't interfere with the other. E.g., one userland process does not have to worry about overwriting the memory of another program because of paging, nor about putting hardware in an invalid state for another process.
it is more secure. E.g. file permissions and memory separation could prevent a hacking app from reading your bank data. This supposes, of course, that you trust the kernel.
How to play around with it?
I've created a bare metal setup that should be a good way to manipulate rings directly: https://github.com/cirosantilli/x86-bare-metal-examples
I didn't have the patience to make a userland example unfortunately, but I did go as far as paging setup, so userland should be feasible. I'd love to see a pull request.
Alternatively, Linux kernel modules run in ring 0, so you can use them to try out privileged operations, e.g. read the control registers: How to access the control registers cr0,cr2,cr3 from a program? Getting segmentation fault
Here is a convenient QEMU + Buildroot setup to try it out without killing your host.
The downside of kernel modules is that other kthreads are running and could interfere with your experiments. But in theory you can take over all interrupt handlers with your kernel module and own the system, that would be an interesting project actually.
Negative rings
While negative rings are not actually referenced in the Intel manual, there are actually CPU modes which have further capabilities than ring 0 itself, and so are a good fit for the "negative ring" name.
One example is the hypervisor mode used in virtualization.
For further details see:
https://security.stackexchange.com/questions/129098/what-is-protection-ring-1
https://security.stackexchange.com/questions/216527/ring-3-exploits-and-existence-of-other-rings
ARM
In ARM, the rings are called Exception Levels instead, but the main ideas remain the same.
There exist 4 exception levels in ARMv8, commonly used as:
EL0: userland
EL1: kernel ("supervisor" in ARM terminology).
Entered with the svc instruction (SuperVisor Call), previously known as swi before unified assembly, which is the instruction used to make Linux system calls. Hello world ARMv8 example:
hello.S
.text
.global _start
_start:
/* write */
mov x0, 1
ldr x1, =msg
ldr x2, =len
mov x8, 64
svc 0
/* exit */
mov x0, 0
mov x8, 93
svc 0
msg:
.ascii "hello syscall v8\n"
len = . - msg
GitHub upstream.
Test it out with QEMU on Ubuntu 16.04:
sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
arm-linux-gnueabihf-as -o hello.o hello.S
arm-linux-gnueabihf-ld -o hello hello.o
qemu-arm hello
Here is a concrete baremetal example that registers an SVC handler and does an SVC call.
EL2: hypervisors, for example Xen.
Entered with the hvc instruction (HyperVisor Call).
A hypervisor is to an OS, what an OS is to userland.
For example, Xen allows you to run multiple OSes such as Linux or Windows on the same system at the same time, and it isolates the OSes from one another for security and ease of debug, just like Linux does for userland programs.
Hypervisors are a key part of today's cloud infrastructure: they allow multiple servers to run on a single hardware, keeping hardware usage always close to 100% and saving a lot of money.
AWS for example used Xen until 2017 when its move to KVM made the news.
EL3: yet another level. TODO example.
Entered with the smc instruction (Secure Mode Call)
The ARMv8 Architecture Reference Model DDI 0487C.a - Chapter D1 - The AArch64 System Level Programmer's Model - Figure D1-1 illustrates this beautifully:
The ARM situation changed a bit with the advent of ARMv8.1 Virtualization Host Extensions (VHE). This extension allows the kernel to run in EL2 efficiently:
VHE was created because in-Linux-kernel virtualization solutions such as KVM have gained ground over Xen (see e.g. AWS' move to KVM mentioned above), because most clients only need Linux VMs, and as you can imagine, being all in a single project, KVM is simpler and potentially more efficient than Xen. So now the host Linux kernel acts as the hypervisor in those cases.
From the image we can see that when the bit E2H of register HCR_EL2 equals 1, then VHE is enabled, and:
the Linux kernel runs in EL2 instead of EL1
when HCR_EL2.TGE == 1, we are a regular host userland program. Using sudo can destroy the host as usual.
when HCR_EL2.TGE == 0 we are a guest OS (e.g. when you run an Ubuntu OS inside QEMU KVM inside the host Ubuntu. Doing sudo cannot destroy the host unless there's a QEMU/host kernel bug.
Note how ARM, maybe due to the benefit of hindsight, has a better naming convention for the privilege levels than x86, without the need for negative levels: 0 being the lower and 3 highest. Higher levels tend to be created more often than lower ones.
The current EL can be queried with the MRS instruction: what is the current execution mode/exception level, etc?
ARM does not require all exception levels to be present to allow for implementations that don't need the feature to save chip area. ARMv8 "Exception levels" says:
An implementation might not include all of the Exception levels. All implementations must include EL0 and EL1.
EL2 and EL3 are optional.
QEMU for example defaults to EL1, but EL2 and EL3 can be enabled with command line options: qemu-system-aarch64 entering el1 when emulating a53 power up
Code snippets tested on Ubuntu 18.10.
Intel processors (x86 and others) allow applications limited powers. To restrict (protect) critical resources like IO, memory, ports etc, CPU in liaison with the OS (Windows in this case) provides privilege levels (0 being most privilege to 3 being least) that map to kernel mode and user mode respectively.
So, the OS runs kernel code in ring 0 - highest privilege level (of 0) provided by the CPU - and user code in ring 3.
For more details, see http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection/

Why Does the BIOS INT 0x19 Load Bootloader at "0x7C00"?

As we know the BIOS Interrupt (INT) 0x19 which searches for a boot signature (0xAA55). Loads and executes our bootloader at 0x7C00 if it found.
My Question : Why 0x7C00? What is the reason ? How to evaluate it through some methods?
Maybe because the MBR is loaded into the memory (by the BIOS) into the 0x7c00 address then int 0x19 searches for the MBR sector signature 0xAA55 on sector 0x7c00
about 0xAA55:
Not a checksum, but more of a signature. It does provide some simple
evidence that some MBR is present.
0xAA55 is also an alternating bit pattern: 1010101001010101
It's often used to help determine if you are on a little-endian or
big-endian system, because it will read as either AA55 or 55AA. I
suspect that is part of why it is put on the end of the MBR.
about 0x7c00:
Check this website out (this might help u in finding the answer): https://www.glamenv-septzen.net/en/view/6
This probably dead but I'm going to answer.
At the start of any bootloader when you set the origin of the segment to 0x7c00 then the registers jump address to that as well. So ideally if you check out some online resources that tell you how to use the int 0x19 command they will guide you on how to jump to another address.
To fix this you would ideally, reset the stack to 0 at the start of every jump to an new address.
( It seems like this duplicates following questions:
What is significance of memory at 0000:7c00 to booting sequence?
Does the BIOS copy the 512-byte bootloader to 0x7c00 )
Inspired by an answer to the former, I would quote two sources:
[Notes] Why is the MBR loaded to 0x7C00 on x86? (Full Edition)
So, first answer the question about the 16KB model from David Bradley's reply:
It had to boot on a 32KB machine. DOS 1.0 required a minimum of 32KB,
so we weren't concerned about attempting a boot in 16KB.
To execute DOS 1.0, at least 32KB is required, so the 16KB model has
not been considered.
Followed by the answer to "Why 32KB-1024B?":
We wanted to leave as much room as possible for the OS to load itself
within the 32KB. The 808x Intel architecture used up the first portion
of the memory range for software interrupts, and the BIOS data area
was after it. So we put the bootstrap load at 0x7C00 (32KB-1KB) to
leave all the room in between for the OS to load. The boot sector was
512 bytes, and when it executes it'll need some room for data and a
stack, so that's the other 512 bytes . So the memory map looks like
this after INT 19H executes:
Why BIOS loads MBR into 0x7C00 in x86 ?
No, that case was out of consideration. One of IBM PC 5150 ROM BIOS
Developer Team Members, Dr. David Bradley says:
"DOS 1.0 required a minimum of 32KB, so we weren't concerned about
attempting a boot in 16KB."
(Note: DOS 1.0 required 16KiB minimum ? or 32KiB ? I couldn't find out
which correct. But, at least, in 1981's early BIOS development, they
supposed that 32KiB is DOS minimum requirements.)
BIOS developer team decided 0x7C00 because:
They wanted to leave as much room as possible for the OS to load
itself within the 32KiB.
8086/8088 used 0x0 - 0x3FF for interrupts vector, and BIOS data area
was after it.
The boot sector was 512 bytes, and stack/data area for boot program
needed more 512 bytes.
So, 0x7C00, the last 1024B of 32KiB was chosen.
Once OS loaded and started, boot sector is never used until power reset.
So, OS and application can use the last 1024B of 32KiB freely.
I hope this answer is based enough to be sure why / how it happened so.