How virtualization technology shutdown the OS? - virtualization

I search something about this question just like KVM, ACPI etc.
I guess that OS implement some interface (like ACPI?), it can receive some kind of signal or command and shutdown itself.
And the host through virtualization technology send a signal or command to OS of instance.
My understanding is right? Can someone give me a direction, thx.

It happens approximately as this.
The VMM (virtual machine monitor) supplies a guest BIOS/UEFI image which, when run inside a virtual machine, populates in-memory ACPI descriptions.
The guest OS reads these ACPI tables and among others finds a description of a button device that corresponds to a power button. It reads which resources are assigned to that button, in particular, how the button is supposed to signal its state. Most commonly, it will be an interrupt with a certain numer and addresses of register(s) used to tell multiple buttons apart.
When a VMM user/admin decides to press a virtual power button, it uses a VMM-specific interface (GUI, CMD-line, script etc.) to command the VMM to do that. The VMM then sets up registers and injects the previously negotiated interrupt number to the guest OS.
It is now the target OS responsibility to react to the signal. From its standpoint, it looked like a real power button was pressed. The guest OS then shuts down individual processes, flushes disk caches and finally uses a platform-specific device access to command the virtual hardware to shut down the power (alternatively, go to hibernate, S5 state, HLT state, reboot etc.) The target OS may well ignore the button press if there is no guest software installed to propagate it to the OS, i.e. Ubuntu without acpi-tools package.

Related

What prevents an user application from being able to "hijack" into kernel mode?

From my understanding. kernel mode is a hardware feature. Ex. it can be set via a register (value1 -> kernel mode, value2 -> user mode).
When the kernel loads and runs an user application, the user application should communicate to the kernel via system call to perform privileged action, during which an interrupt will happen, the execution will switch to kernel mode and the privileged action performed.
My question is:
What is the mechanism that prevents a malicious user application from setting that "mode" register and enter the kernel mode (ex. for x86)?
It make sense that only the kernel can set this register, I would like to know more details about how this is enforced.
I don't know about how this is enforced in hardware itself. It also depends on the architecture. In software for x86, it depends because there are several entry points. When the CPU boots, it is in kernel mode. It can execute every instruction and do whatever it pleases with main memory.
The kernel will thus take advantage of this to set up the page tables and the interrupt handlers during boot before starting any user mode processes.
On x86, kernel mode vs user mode is enforced by the page tables. If a user mode process attempts to access a page which is set as kernel mode it will trigger a fault and call an interrupt handler in kernel mode. The kernel will thus kill the process.
Interrupts are not meant to be an entry point to the kernel. They can still be if a fault happens but then the user mode process won't know and it will sometimes kill the process (if the kernel decides it should).
On x86, the real entry point to the kernel is the LSTAR MSR register. This register can be set from kernel mode only. It can be used alongside the syscall instruction in assembly to jump to the address specified in the register. User mode processes cannot jump in the kernel unless they use the syscall instruction. It thus allows the kernel to set up some services for user mode that are called system calls.

what is the difference between user mode and kernel mode in terms of total number of machine instructions available?

I read this paragraph from " Modern Operating Systems , Tanenbaum "
Most computers have two modes of operation: kernel
mode and user mode. The operating system is the most fundamental piece of software and runs in kernel mode (also called supervisor mode). In this mode it has complete access to all the hardware and can execute any instruction the machine is capable of executing. The rest of the software runs in user mode, in which only a subset of the machine instructions is available.
I am unable to get how they are describing difference in these two modes on basis of machine instructions available , at user end any software has the capability to make any changes at the hardware level ,like we have software which can affect the functioning of CPU , can play with registry details , so how can we say that at user mode , we have only subset of machine instructions available ?
The instructions that are available only in kernel mode are tend to be very few. These instructions are those that are only needed to manage the system.
For example, most processors have a HALT instruction that stops the CPU that is used for system shutdowns. Obviously you would not want any user to be able to execute HALT and stop the computer for everyone. Such instructions are then made only accessible in kernel mode.
Processors use a table of handlers for interrupt and exceptions. The Operating system creates such a table listing the handlers for these events. Then it loads register(s) giving the location(and size) of the table. The instructions for loading this register(s) are kernel mode only. Otherwise, any application could create total havoc on the system.
Instructions of these nature will trigger an exception if executing in user mode.
Such instructions tend to be few in number.
Well, in user-mode, there is definitely a subset of instructions available. This is the reason we have System Calls.
Example:
A user wants to create a new process in C. He cannot do that without entering kernel-mode, because certain set of instructions are only available to kernel-mode, So he uses the system call fork, that executes instructions for creating a new process (not available in user-mode). So System call is a mechanism of requesting a service from kernel of the OS to do something for the user, which he/she cannot write code for.
Following excerpt from above link sums it up in the best way:
A program is usually limited to its own address space so that it
cannot access or modify other running programs or the operating system
itself, and is usually prevented from directly manipulating hardware
devices (e.g. the frame buffer or network devices).
However, many normal applications obviously need access to these
components, so system calls are made available by the operating system
to provide well defined, safe implementations for such operations. The
operating system executes at the highest level of privilege, and
allows applications to request services via system calls, which are
often initiated via interrupts. An interrupt automatically puts the
CPU into some elevated privilege level, and then passes control to the
kernel, which determines whether the calling program should be granted
the requested service. If the service is granted, the kernel executes
a specific set of instructions over which the calling program has no
direct control, returns the privilege level to that of the calling
program, and then returns control to the calling program.

How does a program control hardware?

In order to be executed by the cpu, a program must be loaded into RAM. A program is just a sequence of machine instructions (like the x86 instruction set) that a processor can understand (because it physically implements their semantic through logic gates).
I can more or less understand how a local instruction (an instruction executed inside the cpu chipset) such as 'ADD R1, R2, R3' works. Even how the cpu interfaces with the ram through the northbridge chipset using the data bus and the address bus is clear enough to me.
What I am struggling with is the big picture.
For example how can a file be saved into an hard disk?
Let's say that the motherboard uses a SATA interface to communicate with the HDD.
Does this mean that this SATA interface has an instruction set which can be used by the cpu by preparing SATA instructions written in the correct format?
Does the same apply with the PCI interface, the AGP interface and so on?
Is all the hardware communication basically accomplished through determining a stardard interface for some task and implementing it (by the companies that create hardware chipsets) with an instruction set that any other hardware components can query?
Is my high level understanding of hardware and software interaction correct?
Nearly. It's actually more general than an instruction.
Alot of these details are architecture specific, so I will stick to a high level general overview of how this can be done.
The CPU can read and write to RAM without a problem correct? You can issue instructions that read and write to any memory address. So rather than try to extend the CPU to understand every possible hardware interface out there, hardware manufacturers simply map sections of the address space (where RAM normally would be) to hardware.
Say for example you want to save a file to a hard drive. This is a possible sequence of command that would occur:
The command register of the hard drive controller is address 0xF00, an address that is outside of RAM but accessible to the CPU
Write the instruction to the command register that indicates we want to write to the hard drive.
There could be conceivably an address register at 0xF01 that tells the hard drive controller where to save the data
Tell the hard drive controller that the data I want to write is at some address in RAM, and initiate the write sequence.
There are a myriad of other ways this can be conceivably be done, but the important thing to note is that it is simply using the instructions that the CPU already has for using RAM.
All of this can be done by the CPU without any special instructions on the CPU side, just read and write to an address. You can imagine this being extended, there is a special place in the address space for the USB controller that contains a list of USB devices, there is a special place for the PCI device list, each PCI devices has several registers that can be read and written to instruct them to do things.
Essentially the role of a device driver is to know how these special registers are to be read and written, what kind of commands devices can accept, etc. Often, as is the case with many graphics cards, what these registers do is not documented to the public and so we rely on their drivers to run the cards correctly.

Is kernel a special program that executes always? and why are these CPU modes?

I am new to this OS stuff. Since the kernel controls the execution of all other programs and the resources they need, I think it should also be executed by the CPU. If so, where does it gets executed? and if i think that what CPU should execute is controlled by the kernel, then how does kernel controls the CPU if the CPU is executing the kernel itself!!!..
It seems like a paradox for me... plz explain... and by the way i didn't get these CPU modes at all... if kernel is controlling all the processes... why are these CPU modes then? if they are there, then are they implemented by the software(OS) or the hardware itself??
thanq...
A quick answer. On platforms like x86, the kernel has full control of the CPU's interrupt and context-switching abilities. So, although the kernel is not running most of the time, every so often it has a chance to decide which program the CPU will switch to and allow some running for that program. This part of the kernel is called the scheduler. Other than that the kernel gets a chance to execute every time a program makes a system call (such as a request to access some hardware, e.g. disk drive, etc.)
P.S The fact that the kernel can stop a running program, seize control of the CPU and schedule a different program is called preemptive multitasking
UPDATE: About CPU modes, I assume you mean the x86-style rings? These are permission levels on the CPU for currently executing code, allowing the CPU to decide whether the program that is currently running is "the kernel" and can do whatever it wants, or perhaps it is a lower-permission-level program that cannot do certain things (such as force a context switch or fiddle with virtual memory)
There is no paradox:
The kernel is a "program" that runs on the machine it controls. It is loaded by the boot loader at the startup of the machine.
Its task is to provide services to applications and control applications.
To do so, it must control the machine that it is running on.
For details, read here: http://en.wikipedia.org/wiki/Operating_System

Full emulation vs. full virtualization

In full emulation the I/O devices, CPU, main memory are virtualized. The guest operating system would access virtual devices not physical devices. But what exactly is full virtualization? Is it the same as full emulation or something totally different?
Emulation and virtualization are related but not the same.
Emulation is using software to provide a different execution environment or architecture. For example, you might have an Android emulator run on a Windows box. The Windows box doesn't have the same processor that an Android device does so the emulator actually executes the Android application through software.
Virtualization is more about creating virtual barriers between multiple virtual environments running in the same physical environment. The big difference is that the virtualized environment is the same architecture. A virtualized application may provide virtualized devices that then get translated to physical devices and the virtualization host has control over which virtual machine has access to each device or portion of a device. The actual execution is most often still executed natively though, not through software. Therefore virtualization performance is usually much better than emulation.
There's also a separate concept of a Virtual Machine such as those that run Java, .NET, or Flash code. They can vary from one implementation to the next and may include aspects of either emulation or virtualization or both. For example, the JVM provides a mechanism to execute Java byte codes. However, the JVM spec doesn't dictate that the byte codes must be executed by software or that they must be compiled to native code. Each JVM can do it's own thing and in fact most JVMs do a combination of both using emulation where appropriate and using a JIT where appropriate (the Hotspot JIT I think is what it's called for Sun/Oracle's JVM).
A Hypervisor is a supervisor of supervisors i.e. it's the kernel that controls kernels.
Type 1 vs Type 2 vs Hybrid Hypervisors
A Type 1 hypervisor is an OS designed to run VMs. It is installed directly on the disk to be executed from the boot sector like any OS; it is an OS purpose built to manage and run VMs and that's all you can do on it (and like an OS, it can be monolithic or microkernelised). All OSs installed on it run as guests.
A Type 2 hypervisor is a hypervisor that runs on top of an OS that's designed to run applications, either as an application (full emulation), or by modifying the kernel with a driver to give the OS functionality to run VMs (virtualisation), which installs itself below/alongside the host OS invisibly, and the host OS continues to run (in ring 0 in the case of software virtualisation and non-VMX mode ring 0 in the case of the driver supporting hardware virtualisation) and the hypervisor that is hooked below it manages the guests in VMX non-root mode (in the case of hardware virtualisation) or ring 1 (in the case of software virtualisation) while passing off to the host OS where appropriate and calling into and using the host OS and its drivers to access hardware (which is why it is often pictured to be above the host OS). The GUI program on the host OS communicates with the driver and there is a subprocess per VM and thread per vCPU.
A Hybrid hypervisor is an OS that's designed to run applications and VMs. It can run in regular host OS mode but it has a hypervisor mode, which when booted into loads the host OS as a guest ontop of a hypervisor, and can load other guests. The hypervisor is typically a microkernelised hypervisor, meaning the hardware drivers are implemented in the host OS (called the parent partition) rather than the hypervisor (on Hyper-V, the Integration Services Components drivers can be installed on other guests to communicate with host OS drivers via the VMBUS system that the host OS sets up). The host OS runs in VMX non-root mode with a VMCS. Theoretically you could get a paravirtualised hybrid hypervisor but KVM and Hyper-V only support hardware virtualisation, furthermore, you could also have a monolithic hybrid hypervisor, but doesn't make much sense, and because of the presence of the host OS, it only needs to be microkernelised. A hybrid hypervisor is essentially a type 1 hypervisor that can boot into type 1 hypervisor mode and host OS mode separately. A microkernelised hypervisor is typically hybrid because the host OS used is the one already installed (and which the microkernelised hypervisor functionality is already part of -- it's available as a feature install on windows server)
Fully emulated Type 2 Hypervisors
A full emulator emulates all registers of the target ISA as variables and the CPU is completely emulated. This can be due to wanting to emulate a guest whose ISA is not the same ISA as the host (or indeed it can be the same if you run an x86 emulator e.g. Bochs and you happen to be running it on an x86 system; it doesn't matter. As Peter says, the emulator does not need privileged accesses (ring 0 driver helper), because all interpretation and emulation is done local to the process and the process calls regular host I/O functions. This works because none of the code needs to run natively. If you want it to run natively, you have to bring this functionality to ring 0 via a driver). Full emulation is an emulation of everything: the CPU, the chipset, the BIOS, devices, interrupts, page walk hardware, TLBs. The emulator process runs in ring 3 but this is not visible to the guest which sees emulated/virtual rings (0 and 3) which will be monitored by the interpreter and will emulate interrupts by assigning values to the register variables on violation based on the instruction it is interpreting, mimicking what the CPU would do at each stage but in software. The emulator reads an instruction from an address, analyses it and every time a register e.g. EDX comes up, it will read the EDX variable (emulated EDX). It mimicks the operation of the CPU, which is slow because there are multiple operations for a single operation that is usually handled transparently by the CPU. If the guest attempts to access a virtual address, the dynamic recompiler takes this guest virtual address and traverses the guest page table (mimicking a tlb miss page walker) using the vCR3 and then it reads directly from each physical address produced by vCR3+guest virtual address part using the emulator process page table whose cr3 it has no control over as it is a process and as far as the host OS is concerned the physical address is just a virtual address in the process (guest physical maps to a host virtual by adding an offset and then acting like a host virtual address, so an implicit P2M table). If the dynamic recompiler detects an invalid bit on the guest PTE as it traverses using vCR3 then it simulates a page fault to the guest putting the address in the vCR2.
Software Virtualised Type 2 Hypervisors
Full virtualisation, which is a type 1 hypervisor scheme, can actually be used on type 2 hypervisors and is a step up in performance from the former and can only be used if the guest ISA is the same as the host ISA. Full virtualisation cannot be achieved on x86 because:
There are certain flaws in the implementation of ring 1 in the x86 architecture that were never fixed. Certain instructions that should trap in ring 1 do not. This affects, for example, the LGDT/SGDT, LIDT/SIDT, or POPF/PUSHF instruction pairs. Whereas the "load" operation is privileged and can therefore be trapped, the "store" instruction always succeed. If the guest is allowed to execute these, it will see the true state of the CPU, not the virtualized state. The CPUID instruction also has the same problem.
Actually, this applies to ring 3 too. It's not just a glitch with ring 1. SGDT etc is not a privileged instruction, but allowing the VM to execute it contradicts Popek and Goldberg requirements because the VM can read the real state of the CPU and get the address of the real GDT rather than the virtual. Before UMIP, software full virtualisation was not possible on x86, and before Intel VT, x86 CPUs didn't inherently conform to Popek and Goldberg's requirements, so paravirtualisation had to be used. Paravirtualisation still does not conform to Popek and Goldberg (because only kernel mode code is patched, so SGDT can be used), but at least it works, whereas full virtualisation doesn't work at all, because SGDT will read a bogus value (the host SGDT) in guest kernel mode, meaning the guest kernel code using SGDT will not work as desired if it is not patched. SGDT being available in user mode at least doesn't compromise the host OS, whereas LGDT definitely would.
VirtualBox uses ring 1 full virtualisation, but paravirtualises the problematic instructions that act like they are executing in ring 0 despite being in ring 1, and requires the help of a ring 0 driver; the driver functions as the hypervisor. Surprisingly, there is very little information on how type 2 hypervisors are implemented. The following will be my best guess on the matter -- how I would implement a type 2 hypervisor given the hardware and host OS operation.
On Windows, I'd imagine when the driver starts it will initialise symbolic links and wait for the user mode virtualbox software to issue IOCTLs using DeviceIoControl to start a virtual machine instance. The handler will perform the following process: The driver injects a handler into the IDT for the general protection fault. It can do this by putting a wrapper around KiInterruptDispatch by replacing KiInterruptTemplate in the IDT with the wrapper. On windows, it could inject a wrapper into all IDT entries including entry bug check entries but this means hooking into the IDT write routines for new interrupts. What it probably does to achieve this is read the virtual address in IDTR and write protect the region and then host updates to the IDT will trap into the hypervisor GPF wrapper which will install a wrapper at the IDT entry written to.
However, a 64 bit windows guest on a 64 bit windows host needs to be able to have its own kernel space, but the problem is, it will be at exactly the same location as the host kernel structures. Therefore, the driver needs to wipe the whole kernel view of the virtualbox process. This cannot be mapped in or visible to the guest. It does this by removing the entries from the cr3 page of the virtualbox process. The GDT and IDT used by the virtualbox process and other host processes needs to be the same, but in order to avoid reserving guest virtual addresses, when the guest writes to the IDTR, the hypervisor could use this as the actual IDTR value, but virtually map it in the SPT to the same physical 4KiB IDT frame that the host uses. This means that the hypervisor driver needs to change the IDTR when switching between the guest and host threads. Because the guest virtual page that maps the IDT is write protected, any writes to this range by the guest will be logged by the hypervisor in a guest IDT that it builds if the cr3 is of one of its guests' processes. The issue is that when the ISR is handled, it will jump to a hypervisor RIP that is not mapped into the process because the driver lies in the host kernel; therefore, the RIP of this wrapper needs to be mapped into the SPT. This means you can't get away with reserving no virtual memory in the guest, and for that reason, you could probably get away with reserving the 4KiB address range the host uses for its IDT and silently redirecting guest accesses to a different host physical page and then not having to change the IDTR on a task switch. All reserved memory for the handlers in the host IDT would also have to be redirected silently to different host physical pages (because they will be supervisor pages so they will fault anyway and the hypervisor just redirects the reads and writes to a different host physical page, which won't happen after an interrupt because it will be in ring 0, so the jump in the IDT will be in the real host physical page mapped to it as it doesn't GPF so the hypervisor can't redirect), so the guest is unaware that that region is reserved. There will be a different wrapper for each IDT entry which will call a main handler which also needs to be mapped and pass an IDT entry code. The handler will pass the cr3 in a register, change the cr3 to a dummy process that maps the host kernel and then it will call the main handler. The handler checks the cr3 and if it is a guests shadow cr3 or host cr3 and perform the appropriate action.
The driver will also have to inject itself into the clock interrupt in the same way -- if the clock interrupt fires, the guest state or host state (which includes current cr3) is pushed and the hypervisor handler will push the address of the guest IDT clock interrupt onto the kernel stacks of all vCPU threads it manages (emulating what the CPU would do) in a new trap frame if there isn't one already present and then call the original host handler after changing the cr3 to one that maps the host kernel. This would ensure a context switch in the guest every time it is scheduled in on the host and therefore guest clock interval would roughly match up to host clock interval.
Full virtualisation would be referred to as 'trap and emulate', but it is not full emulation because all ring 3 code actually runs on the host CPU (as opposed to full emulation where the code that runs is the interpreter which fetches lines to read). Also, the TLBs and page walk hardware are actually used directly whereas on the emulator, every memory access requires a walk in software if not present in an emulated TLB array in software. Only the privileged instructions and registers, interrupts, devices and BIOS are emulated to the guest -- partial emulation -- emulation still occurs, but when any amount of the code runs natively, it becomes referred to as a virtualisation (full, para or hardware assisted).
When the guest traps into the guest OS it will either use INT 0x2e or syscall. The hypervisor obviously has injected a wrapper at 0x2e for INT and it will insert a handler at the SYSENTER_CS_MSR:SYSENTER_EIP_MSR for sysenter or IA32_LSTAR MSR for syscall. The handler in the MSR needs to be mapped into the SPT and will check to see if the cr3 is the shadow of one of the guest processes and if it isn't it doesn't need to change cr3 as the current will contain the host kernel and jumps to the host handler. If it is a cr3 of a guest process, it changes the cr3 to a dummy process (probably a virtualbox host process specifically for IO tasks that maps the host kernel) and jumps to a main handler, passing RIP in guest IDT that it has built to the recompiler/patcher which walks through and paravirtualises certain instructions that aren't guaranteed to trap, replacing them permanently with jumps to hypervisor memory where it places better code (which will cause protection faults as they're ring 0 in the SPT) until it reaches a IRET or sysexit etc and then it changes back the cr3 to that of the guest and executes an IRET after putting a ring 1 privilege on the stack to the RIP in the guest IDT it has built and then the actual guest ISR executes. When a trap occurs due to executing a ring 0 instruction in ring 1 or an inserted paravirtualised trap occurs, the ISR injected at the general protection fault entry / hypervisor ISR will make sure that the cr3 is of a guest process and it will claim and handle the issue, if it isn't then the cr3 doesn't need to be changed to one that includes host kernel in order to pass control to the host handler because it will be in the context of a non guest process. One instance where this could occur is the guest writing to cr3 for a guest context switch. This needs to be emulated as the guest must not be able to execute this instruction and modify the cr3 because it would change the cr3 of the host process on the host OS; the hypervisor needs to incept the write and write a new shadow cr3 and not the cr3 the guest wants. When the guest reads cr3, this mechanism prevents the guest from reading the real cr3 and the hypervisor inserts the value of the guest inserted cr3 (not the shadow one) into the requested register, inserts next instruction address onto the stack and resumes execution with an iret to the ring it was in.
Guest I/O will be targeted at a guest physical address space that maps onto virtual buffers and registers of emulated devices defined in the hypervisor. These emulated registers (e.g. doorbell registers) will be checked in a host context at regular intervals (clock interrupt hook for instance) in the exact same way a device would react to changes to hardware registers and the handler will decide whether an interrupt needs to be emulated (pushing an interrupt onto the kernel stack of the thread representing the selected vCPU to interrupt based on MSI vector assigned by guest in the emulated configuration space) or, due to an emulated register write, an I/O operation needs to be constructed using the Native windows API functions to the guest specified buffer (translating GVA->HPA and allowing real hardware to write to the physical page that the guest buffer will use).
As for paging on a paravirtualised 64 bit type 2 hypervisor, it is a tricky one. The hardware uses a shadow page table (SPT) which is a mapping of GVAs to HPAs. My best guess is that the hypervisor driver selects a shadow cr3 page from the locked pages of the virtualbox process for every GP fault (executing ring 0 instruction in ring 1) that it sees a new guest assigned cr3 address being written to cr3. It pairs this guest chosen address with the address of the hypervisor chosen shadow cr3 page and changes the virtualbox process cr3 to that of the shadow cr3 rather than the guest one that was attempted to be written. The shadow cr3 page (you'll see written everywhere that the guest page tables are write protected but it just has to be wrong because it is the shadow page tables that run on the CPU and therefore are the only ones that can cause protection faults; the shadow cr3 is used not the guest cr3) is write protected by the kernel driver (which is done by read/write bit in the recursive PML4 entry to itself). The cr3 page of certain GPA that the guest attempts to use will be translated to its associated HPA by the hypervisor and the entries in the cr3 page will be copied to the shadow cr3 and GPA addresses in the PML4Es will be translated to the HPAs using the P2M table. Every time the guest goes to write this to the guest cr3 page by virtual address, this virtual address will always be of the shadow cr3 page, not the guest cr3 page, and it will fault because of the write protect bit and being in ring 1. The handler injected at the general protection fault will then see a shadow cr3 of one of its guests processes and it will perform the write that was conceptually attempted in guest PTE, in the SPT at the same location (where it actually faulted), and it inserts the host physical address instead of the guest physical address that it tried to write (which it translates using the P2M TLB or P2M; I think the P2M is filled when you start the VM, because the VirtualBox process uses VirtualLock to lock the specified amount of RAM for the virtual machine) (the hypervisor can maintain virtual TLBs for the P2M (guest frame to host frame mappings) and guest page tables (guest virtual page to guest frame mappings), which it can check before performing software page walks, whereas the hardware maintains the TLBs for the SPT). Then the hypervisor will check the virtual TLBs for a quick translation of the CR2 GVA to a GPA; if not present it will trace the guest page table (by accessing the guest cr3 via its HVA (translates GPA->HPA using P2M and then HPA->HVA using a kernel function)) and write to the entry as the guest wanted with the attempted guest GPA. When a page fault occurs, the handler checks the shadow cr3 is one of its guest processes and checks the SPT (gets virtual address of entry associated with faulting GVA using Windows kernel function, as if it were a regular process) and then walks the guest page table using the guest cr3 associated with the current cr3, parsing the SPT virtual address that faulted (translates GPA -> HPA -> HVA). If the shadow PTE is invalid then it is a shadow page fault. If the guest PTE is invalid as well then it emulates an interrupt using the RIP of the address in the page fault entry of the guest IDT pushing it on the stack; before it does this it patches the code in the recompiler as described before (when guest reads from its page table during the interrupt, it will actually be reading the SPT and therefore the SPT needs to be read protected with a supervisor bit so it can be intercepted and the guest page table entry can be read instead from the address in the faulting memory access). For any other interrupt that occurs i.e. a host device, it is not meant for the guest and therefore if the handler sees the current cr3 belongs to a process of one of its guests it will change the cr3 to a dummy process that contains the host kernel mapping and calls the original KiInterruptTemplate for the host handler; after the host handler has finished, it will replace the cr3.
Hardware assisted Type 2 Hypervisors
Hardware assisted type 2 is a further step up in performance and makes the situation a lot less convoluted and unifies it into a single interface and automates lots of makeshift cr3 juggling and administrative tasks that needed to be improvised, making it a lot cleaner. The kernel driver only needs to execute vmxon, wait for guests to register with the driver and then all VM Exit events will be handled by a unified handler at a RIP and CR3 it inserts into the VMCS host state (meaning the handler stub does not need to be mapped in the guest kernel virtual address space). It is specifically designed for this, unlike ring 1, which means the recompiler (Code Scanning and Analysis Manager (CSAM) and the Patch Manager (PATM)) is not required. It also has things like TSC scaling and TSC offset fields which can be used by guests which employ the TSC for fairer scheduling. The hypervisor still hooks the clock interrupt to perform I/O updates and if the currently executing thread is the address of the thread for one of its vCPUs, it will need to vxmoff (which will cause a VM exit) and push the address of some reinitialisation sequence in host kernel memory that will vmxon and vmresume the VMCS tied to the vCPU with the guest saved state in it (but with an emulated clock interrupt in place ready to execute, whose code will use RDTSC which will VM exit and the offsets in VMCS can be used by the hypervisor to report a value accounting for time the guest wasn't scheduled in on the host, i.e. to subtract host time away from it to make the host invisible). It doesn't need to change the cr3 because the vmxoff does that automatically so now it can pass it to the host handler to perform the clock interrupt handing procedure for the host OS.
If EPT is supported, then the guest chosen physical addresses (cr3, IDTR etc.) and page tables run on the actual hardware in vmx non-root mode. GVAs are translated to HPAs as such: the guest CR3 is used to produce a GPA of the PDPT, which is then run through the whole EPT using the EPTP of the guest to eventually get the HPA of the PDPT, and so on (it's the same process as software virtualisation with the guest page table and the P2M, except the page walk is done on actual page walk hardware, which is faster). When there is a page fault, a vm exit does not occur and the guest chosen IDTR is present so the interrupt gets handled as a non root ring 0 using the guest IDT. The guest can update this mapping and the hypervisor doesn't need to intervene. When the access is reattempted, an EPT fault will cause a VM exit with the EPTP equivalent of cr2 and a pointer to the hypervisors EPTP for the guest. It will then update its mapping and VMRESUME to the RIP of the faulting instruction.
In full emulation the I/O devices , CPU , main memory are virtualized.
No, they are emulated in software. Emulated means that their behavior is completely replicated in software.
But what exactly is full virtualization?
With virtualization, you try to run as much code as you can on the on hardware to speed up the process.
This is especially a problem with code that had to be run in kernel mode, as that could potentially change the global state of the host (machine the Hypervisor or VMM is running on) and thereby affect other virtual machines.
Without either emulation or virtualization, code runs directly on the hardware. Its instructions are executed natively by the CPU, and its I/O accesses directly access the hardware.
Virtualization is when the guest code runs natively at least some of the time, and only traps to host code running outside the virtual-machine (e.g. a hypervisor) for privileged operations or I/O accesses.
To handle these traps (aka VM exits), the VM may actually emulate what the guest was trying to do. E.g. the guest might be running a device driver for a simple network card, but the NIC is implemented purely in software in the VM. If the VM used a pass-through to send the guest's I/O accesses to a real network card on the host, that would be virtualization of that hardware. (Especially if it did it in a way that let multiple guest use it at once, otherwise it's really just giving it to one guest, not virtualizing it.)
Hardware support for virtualization (like Intel's and AMD's separate x86 virtualization extensions) can let the guest do things that would normally affect the whole machine, like modify the memory mappings in a page table. So instead of triggering a VM exit and making the VM figure out what the guest was doing and then modifying things from the outside to achieve the result, the CPU just has an extra translation layer built in. (See the linked wiki article for a much better but longer description of software-based virtualization vs. hardware-assisted virtualization.)
Pure emulation means that guest code never runs natively, and never sees the "real" hardware of the host. An emulator doesn't need privileged access to the host. (Some might want privileged access to the host for device pass-through, or for raw network sockets to let a guest look like it's really attached to the same network as the host).
An ARM emulator running on an x86 host always has to work this way, because the host hardware can't run ARM instructions in the first place.
But you can still emulate an x86 guest on an x86 host, for example. The fact that the guest and host architectures match doesn't mean the emulator has to take advantage of that fact.
For example, BOCHS is an x86 PC emulator written in portable C++. One of its main uses is for debugging bootloaders and OSes.
BOCHS doesn't care if it's running on an x86 host or not. It's just a C++ program that reads binary files (disk images) and draws in a window (contents of guest video memory). As far as the host is concerned, it's not particularly different from a JPG viewer or a game.
Some emulators use binary translation to JIT-compile the guest code into host code, but this is still emulation, not virtualization. See http://wiki.osdev.org/Emulator_Comparison.
BOCHS is relatively slow, since it reads and decodes guest instructions directly, without doing binary translation. But it tries to do this as efficiently as possible. See How Bochs Works Under the Hood for some of the tricks it uses to efficiently keep track of the guest state. Since emulation is the only option for running x86 software on non-x86 hardware, it's useful to have a high-performance emulator. BOCHS has some very smart and experienced emulator developers working on it, notably Darek Mihocka, who has some interesting articles about optimizing emulation on his site.
This is an attempt to answer my own question.
System Virtualization : Understanding IO virtualization and role of hypervisor
Virtualization
Virtualization as a concept enables multiple/diverse applications to co-exist on the same underlying hardware without being aware of each other.
As an example, full blown operating systems such as Windows, Linux, Symbian etc along with their applications can coexist on the same platform. All computing resources are virtualized.
What this means is none of the aforesaid machines have access to physical resources. The only entity having access to physical resources is a program known as Virtual Machine Monitor (aka Hypervisor).
Now this is important. Please read and re-read carefully.
The hypervisor provides a virtualized environment to each of the machines above. Since these machines access NOT the physical hardware BUT virtualized hardware, they are known as Virtual Machines.
As an example, the Windows kernel may want to start a physical timer (System Resource). Assume that ther timer is memory mapped IO. The Windows kernel issues a series of Load/Store instructions on the Timer addresses. In a Non-Virtualized environment, these Load/Store would have resulted in programming of the timer hardware.
However in a virtualized environment, these Load/Store based accesses of physical resources will result in a trap/Fault. The trap is handled by the hypervisor. The Hypervisor knows that windows tried to program timer. The hypervisor maintains Timer data structures for each of the virtual machines. In this case, the hypervisor updates the timer data structure which it has created for Windows. It then programs the real timer. Any interrupt generated by the timer is handled by the hypervisor first. Data structures of virtual machines are updated and the latter's interrupt service routines are called.
To cut a long story short, Windows did everything that it would have done in a Non-Virtualized environment. In this case, its actions resulted in NOT the real system resource being updated, but virtual resources (The data structures above) getting updated.
Thus all virtual machines think they are accessing the underlying hardware; In reality unknown to them, all accesses to physical hardware is mediated through by the hypervisor.
Everything described above is full/classic virtualization. Most modern CPUs are unfit for classic virtualization. The trap/fault does not apply to all instructions. So the hypervisor is easily bypassed on modern devices.
Here is where para-virtualization comes into being. The sensitive instructions in the source code of virtual machines are replaced by a call to Hypervisor. The load/store snippet above may be replaced by a call such as
Hypervisor_Service(Timer Start, Windows, 10ms);
EMULATION
Emulation is a topic related to virtualization. Imagine a scenario where a program originally compiled for ARM is made to run on ATMEL CPU. The ATMEL CPU runs an Emulator program which interprets each ARM instruction and emulates necessary actions on ATMEL platform. Thus the Emulator provides a virtualized environment.
In this case, virtualization of system resources is NOT performed via trap and execute model.
A more recent response:
From my research i can say that this is a better response to understand how concept appear:
The first concept of emulation actually dates back to the first computer, the Colossus. It was used by the British government in 1941 to mimic the functions of the Nazi Enigma code machine. Emulation theory was developed in 1962 and was conceived by three IBM engineers working from three different angles.
Emulation means to mimic the behavior of the target which can be hardware, like the emu8086 emulator, or can be software like emulation of a service from some network port.
You want to immitate the set of functions provided by the target and maybe you are not interested in the internal mechanism.
Why would you want that? For controlling that functions. Why control? For multiple reason which is very large subject to be discuss here. But keep in mind that you want to be behind the things.
But such process is costly for performance. You have an instruction for which are executed a lot of other instruction. Maybe you are interested to control only some of that instructions. So we would like to permit some of instructions to be executed native.
So what happens when all of this instructions execution became native? Then you have ideal virtualization. You can virtualize any software, but the trend today is to pass from virtualization of operating systems to that of application. Also i say ideal because this software have a different execution on each hardware so it will be need to also emulate some instructions.Is important to understand that most of virtualize technologies from today are not only about virtualize, but also about emulation.
Also notice that in our transition from emulation to virtualization, the input which of system is reduced, because virtualization accept only software as input. The controller of these flow of instructions is named HyperVisor.
Virtualization may happen at different layers of a computer architecture, which are (from higher to lower): 1: Application, 2: Library, 3: Operating System, 4: Hardware Abstraction (HAL), 5: Instruction Set Architecture (ISA). Below the latter layer there is the Hardware.
Tipically a certain layer utilizes services from a lower layer by utilizing the instructions the lower layer exposes in its interface.
Note that the usage of service is not strictly related to the layering, in the sense that certain layers can skip the layer immediately below and utilize instruction from lower layers. As an example an Applications may provide certain instructions directly to the HAL layer, skipping the Library and O.S. layers.
To "emulate an instruction" means to intercept and map an instruction intended for a certain layer of a computer architecture (virtual) into a sequence (one or more) instruction(s) for the same layer of a different computer architecture (non-virtual).
It is possible to place the virtualization layer at different layers of a Computer Architecture. This point may introduce confusion.
As an example, when virtualizing at the level of the Hardware Abstraction Layer (e.g. VMware, VirtualBox), a virtual layer is placed between the HAL layer and the Operating system Layer. The Operating system utilizes instructions of the virtual HAL Layer, then certain virtual ISA (Instruction Set Architecture) are mapped by the hypervisor to ISA for the physical system. When ALL the instruction are emulated, we talk about full emulation, which is a special case of virtualization. In virtualization tipically we try to make a layer to execute directly instruction of the non-virtual layer as much as possible for performance reasons.
In another example, the virtualization layer is placed over the Operative System (Virtualization at Operative System Level): in this case a Virtual Machine is named Container (e.g. Docker). It includes the levels from Application to the O.S. (included).
To conclude, emulation is related to single instruction, while "full emulation" happens when we intercept and map ALL the instructions of a certain layer.
Tipically, the term "full emulation" is used when the virtualization layer is placed at the ISA level (lower level possible). In this case a Virtual Machine includes all the levels from the Application to the ISA, and ALL the ISA are intercepted and mapped. This is tipically used to virtualize niche products, such as Cisco routers (e.g. with QEMU) or 90's video game consoles, having a completely different architecture from the usual commonly available computers. Note however that there may be a "full emulation" also at other levels, which is tipically not necessary.
Virtualization and Emulation are pretty much the same thing. There is one underlying concept that these two words hint at. That is, these two words are aspects of one thing. This is demonstrated in QEMU, a Quick Emulator that performs hardware virtualization.
You can think of that one thing as Simulation. Simulation can also be a confusing word though.
First we can define the common meaning of the words.
Simulation: Making one thing do what another thing does.
Emulation: Making one system replicate another system exactly.
Virtualization: Allow for running of a system within another system.
Now we show that the words all mean pretty much the same thing. For example, in simulation you are creating a replica of one system with another system. That is the common meaning of emulation. In virtualization, you want to have your virtualized system act like the real system. That is, ideally it acts like a replica, even though it may be implemented differently and may not "emulate" the hardware exactly. That is the same as simulation pretty much. In an emulation, you simulate another system, etc..
So we can see that the words are somewhat interchangeable. The underlying concept is simulation.
In virtualization, such as operating system virtualization ("virtual machines"), we are creating a system which acts like the operating system. It might use tricks from the underlying hardware, or hypervisors, or other things, for performance and security. But in the end it is just a simulation of an operating system. Typically when the word "virtual machine" is used, it is not an exact replica of the machine (as in an emulator). It just does enough to allow programs to run as you would expect on the real operating system.
In emulation, it is typically meant that the simulation is "exact". In hardware emulation, you replicate all of the features of the hardware system. This means that you have created a simulation of the hardware. You could say that you created a virtualization of the hardware, but here is where virtualization slightly differs. Virtualization implies creating an isolated environment, which emulation doesn't necessarily imply. So a hardware emulator might provide the same interface to the hardware as the hardware itself, but the implementation of the emulator might rely on global memory, so if you try to run two emulators at the same time, they would interfere with each other. This is what virtualization solves, it isolates the simulations.
Hope that helps.
I think it's a common misconception to oppose Virtualization to Emulation when they're not comparable.
What people have in mind when they talk about Virtualization is mostly what type 2 hypervisors do.
According to wikipedia, virtualization is :
Virtualization or virtualisation (sometimes abbreviated
v12n, a numeronym) is the act of creating a virtual (rather than
actual) version of something, including virtual computer hardware
platforms, storage devices, and computer network resources.
This definition suits both emulation and type 2 hypervisor. Therefore, an Emulator is a subtype of virtualization, and Type 2 Hypervisor is another subtype. Both let you run a virtual machine, but the way they work and the way they're used often differ. Many virtual machines actually rely on both techniques to achieve their goal.
Moreover, emulation doesn't always replicate the original hardware 1:1 (by design and not by lack of documentation), such as DOSBox which simulates a kind of PC that doesn't really exist, or high level emulators (like the old Ultra HLE). This makes the emulator more efficient (but with the risk of breaking compatibilty with softwares). Other emulators also do this for a different purpose : to expand the capabilities of the original hardware (such as dolphin that let you run the game in 4K, or PS1 emulators that let you improve dramatically the quality of the 3D, or more recently, a SNES emulator with a modified PPU that can output 16:9 graphics and that's used for a modded super mario world patched to run in widescreen).
Some emulators can also use hardware ressources like a video card. An example of this is Connectix VirtualPC, an old PC emulator for PowerPC based macs. Back then macs and PCs both had PCI slots, and Connectix VirtualPC gave you the possibility to use a video card that was physically in your mac (which also existed on PC).
I hope this clarifies things.