CPU, Disk, RAM, Ethernet Data Flow [closed] - operating-system

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
Improve this question
Trying to have an understanding of data flow on a modern consumer desktop computers.
Looking first at a SATA port. If you want to load some bytes into RAM does the CPU send that request to the memory controller which handles the request to the SATA device and is that data than moved onto RAM by the memory controller or does the CPU cache or registers get involved with the data at all?
I assume the OS typically blocks the thread until an I/O request is completed. Does the memory controller send an interrupt to let the OS know it can schedule that thread into the queue again?
Ethernet: So assuming the above steps are complete where some bytes of a file have been loaded onto RAM does the memory get moved to ethernet controller by the memory controller or does the CPU get involved holding any of this data?
What if you use socket with localhost? Do we just do a round about with the memory controller or do we involve the ethernet controller at all?
SATA to SATA storage transfer buffered anywhere?
I know that is a lot of questions, if you can comment on any I would appreciate it! I am really trying to understand the fundamentals here. I have a hard time moving on the higher levels of abstraction without these details...

The memory controller doesn't create requests itself (it has no DMA or bus mastering capabilities).
The memory controller is mostly about routing requests to the right places. For example, if the CPU (or a device) asks to read 4 bytes from physical address 0x12345678, then the memory controller uses the physical address to figure out whether to route that request to a PCI bus, or a different NUMA node/different memory controller (e.g. using quickpath or hyper-transport or omni-path links to other chips/memory controllers), or to its locally attached RAM chips. If the memory controller forwards a request to its locally attached RAM chips; then memory controller also handles the "which memory channel" and timing part; and may also handle encryption and ECC (both checking/correcting errors, and reporting them to OS).
Most devices support bus mastering themselves.
Because most operating systems use paging "contiguous in virtual memory" often doesn't imply "contiguous in physical memory". Because devices only deal with physical addresses and most transfers are not contiguous in physical memory; most devices support the use of "lists of extents". For example, if a disk controller driver wants to read 8 KiB from a disk, then the driver may tell the disk controller "get the first 2 KiB from physical address 0x11111800, then the next 4 KiB from physical address 0x22222000, then the last 2 KiB from physical address 0x33333000"; and the disk controller will follow this list to transfer the pieces of an 8 KiB transfer to the desired addresses.
Because devices use physical addresses and almost all software (including kernel, drivers) typically primarily use virtual addresses, something (kernel) will have to convert the virtual addresses (e.g. from a "read()" function call) into the "list of extents" that the device driver/s (probably) need. When an IOMMU is being used (for security or virtualization) this conversion may include configuring the IOMMU to suit the transfer; and in that case its better to think of them as "device addresses" that the IOMMU may convert/translate into physical addresses (where the device uses "device addresses" and not actual physical addresses).
For some (relatively rare, mostly high-end server) cases; the motherboard/chipset may also include some kind of DMA engine (e.g. "Intel QuickData technology"). Depending on the DMA engine; this may be able to inject data directly into CPU caches (rather than only being able to transfer to/from RAM like most devices), and may be able to handle direct "from one device to another device" transfers (rather than having to use RAM as a buffer). However; in general (because device drivers need to work when there's no "relatively rare" DMA engine) it's likely that any DMA engine provided by the motherboard/chipset won't be supported by the OS or drivers well (and likely that the DMA engine won't be supported or used at all).

Looking first at a SATA port. If you want to load some bytes into RAM does the CPU send that request to the memory controller which handles the request to the SATA device and is that data than moved onto RAM by the memory controller or does the CPU cache or registers get involved with the data at all?
The CPU is not involved during the operation. It is a AHCI which makes the transfer which is a PCI DMA device.
The AHCI from Intel (https://www.intel.ca/content/www/ca/en/io/serial-ata/serial-ata-ahci-spec-rev1-3-1.html) is used for SATA disks. Meanwhile, for more modern NVME disks an NVME PCI controller is used (https://wiki.osdev.org/NVMe).
The OS will basically write in RAM to write to the registers of the AHCI. This way, it will be able to tell the AHCI to do stuff and tell it to write to RAM at a certain positions (probably in a buffer provided/allocated by the user mode process which asks for the data on disk). The operation is DMA so the CPU is not really involved.
From C++, you ask for data probably using fstream or direct API calls made to the OS in the library provided by the OS. For example, on Windows, you can use the WriteFile() function (https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile) to write to some files.
Underneath, the library is a thin wrapper which makes system calls (it is the same for the standard C++ library). You can browse my answer at Who sets the RIP register when you call the clone syscall? for more information on system calls and how they work on modern systems.
The memory controller is not really involved. It is involved to write to the registers of the AHCI but Brendan's answer is probably more useful for that matter because I'm not really aware.
I assume the OS typically blocks the thread until an I/O request is completed. Does the memory controller send an interrupt to let the OS know it can schedule that thread into the queue again?
Yes the OS blocks the thread and yes the AHCI triggers an MSI interrupt on command completion.
The OS will put the process on the queue of processes which are waiting for IO. The AHCI does trigger interrupts using the MSI capability of PCI devices. The MSI capability is a special capability which allows to bypass the IO APIC of modern x86 processors and to directly send an inter-processor interrupt to the local APIC. The local APIC then looks in the IDT as you would expect for the vector to trigger and makes the processor jump to the handler associated.
It is the handler number which differentiates between the devices that triggered the interrupt which makes it easy for the OS to just place a proper handler for the device on that interrupt number (a driver).
The OS has a driver model which takes for account the different types of devices that will be used on modern computers. The driver model will often be in the form of a virtual filesystem. The virtual filesystem basically presents every devices to the upper layer of the OS as a file. The upper layer basically makes open, read, write and ioctl calls on the file. Underneath, the driver will do complex things like triggering read/write cycles by writing the registers of the AHCI and put the process on other queues to wait for IO.
From user mode (like when calling fstream's open method), it is really a syscall which is made. The syscall handler will check all permissions and make sure that everything is okay with the request before returning a handle to the file.
Ethernet: So assuming the above steps are complete where some bytes of a file have been loaded onto RAM does the memory get moved to ethernet controller by the memory controller or does the CPU get involved holding any of this data?
The Ethernet controller is also a PCI DMA device. It reads and writes in RAM directly.
The Ethernet controller is a PCI DMA device. I never wrote an Ethernet driver but I can tell it just reads and writes in RAM directly. Now for network communications you will have sockets which act similarly to files but are not part of the virtual filesystem. The network cards (including Ethernet controllers) are not presented to the upper layer as files. Instead, you use sockets to communicate with the controller. Sockets not implemented in the C++ standard library but are present on all widespread platforms as OS provided libraries that must be used from C++ or C. Sockets are also system calls on their own.
What if you use socket with localhost? Do we just do a round about with the memory controller or do we involve the ethernet controller at all?
The ethernet controller is probably not involved.
If you use sockets with localhost, the OS will simply send the data to the loopback interface. The Wikipedia article is quite direct here (https://en.wikipedia.org/wiki/Localhost):
In computer networking, localhost is a hostname that refers to the current computer used to access it. It is used to access the network services that are running on the host via the loopback network interface. Using the loopback interface bypasses any local network interface hardware.
SATA to SATA storage transfer buffered anywhere?
It is transferred to RAM from the first device then transferred to the second device afterwards.
On the link I provided earlier for the AHCI it is stated that:
This specification defines the functional behavior and software interface of the Advanced Host Controller Interface (ACHI), which is a hardware mechanism that allows software to communicate with Serial ATA devices. AHCI is a PCI class device that acts as a data movement engine between system memory and Serial ATA devices.
The AHCI isn't for moving from SATA to SATA. It is for moving from SATA to RAM or from RAM to SATA. The SATA to SATA operation thus involves bringing the data in RAM then to move the data from RAM to the other SATA device.

Related

How does the kernel initialize and access the rest of the hardware on x86/64?

I'm interested in the details of how operating systems work and perhaps writing my own.
From what I've gathered, the BIOS/UEFI is supposed to handle setting up the hardware, and do things like memory-mapping (or IO ports for) the graphics card and other IO devices like audio and ethernet.
My question is, how does the kernel know how to access and (re)configure these devices when it's passed control from the bootloader? Are there just conventions like 'the graphics card is always memory mapped from X to Y address space'? Are you at the mercy of a hardware manufacturer writing a driver for an operating system which knows how the hardware will be initialized?
That seems wrong, so maybe the kernel code includes instructions which somehow iterate through all the bus-connected devices. But what instructions can accomplish that? Is the PCI(e) controller also a memory-mapped device? How do you begin querying and setting up the system?
My primary reference has been the Intel 64 Architectures Software Developer's Manual, which has excellent documentation on how the CPU works, but doesn't describe how the system is setup.
I never wrote a firmware so I don't really know how that works in general. You probably have some memory detection done like an actual memory iteration that is done and some interrogation of PCI devices that are memory mapped in RAM. You also probably have some information in the Developer's manuals as to how you should get some information about memory and stuff like that.
In the end, the kernel doesn't need to bother about that because the firmware takes care to do all that and to provide temporary drivers before the kernel is set up completely.
The firmware passes information to the kernel using the ACPI tables so that is the convention you are looking for. The UEFI firmware launches the /efi/boot/bootx64.efi EFI app from the hard-disk automatically. It calls the main function of that app often called the bootloader. When you write that application, often with frameworks such as EDK2 or GNU-EFI, you can thus use the temporary drivers to get some information like the location of the RSDP which points to all other ACPI tables.
The ACPI convention specifies a language that is AML which, when your kernel interprets, tells it all about hardware. You thus have all the required information there to load drivers and such.
PCI (which is everything nowadays) is another thing. It works with memory mapped IO but the ACPI tables (the MCFG) is helpful to find the beginning of the configuration space for PCI devices that take the form of memory mapped registers.
As to graphics cards, you probably don't want to start with those. They are complex and, at first, you should probably stick to the framebuffer returned by UEFI and at least write a driver for xHCI which is the PCI host controller responsible to interact with USB including keyboards and mouses.

how do operating system call bios functions?

since we know that operating system works in protected mode
and BIOS works in a Real mode ( 16 bit). so when an interrupt is called from
either Operating System or application program, does CPU switch mode back and forth everytime?
In general; hardware is capable of doing lots of things at once (playing sounds while generating 3D graphics while sending data to network while transferring data to multiple disks while waiting for user input while all CPUs are busy doing actual processing); and BIOS functions are not capable of allowing more than one thing to happen at time (e.g. will waste 100% of CPU time waiting for a hard disk controller to transfer data while the CPU does nothing and while nothing else can use any other BIOS service for anything).
For this reason alone, BIOS services are not usable and not used by any modern OS (except briefly during boot).
Of course it's not the only reason - no IO prioritisation, no support for hot-plug of any kind, no support for power management, no support for system management (e.g. https://en.wikipedia.org/wiki/S.M.A.R.T. ), no support for GPU, no support for any sound card (other than "PC speaker beep"), no support for networking (excluding PXE), no support for IO APICs, etc. Ironically, out of all of the problems, "BIOS services are designed for real mode" is the least important problem because it's the only problem that an OS can work around.
Instead, each OS has native drivers that don't have any of these limitations.
Note: this is partly why it's relatively easy for modern operating systems to support UEFI (where none of the BIOS services exist at all) - for most, they replace the boot loader/boot code and don't need to change much else.

How does a program control hardware?

In order to be executed by the cpu, a program must be loaded into RAM. A program is just a sequence of machine instructions (like the x86 instruction set) that a processor can understand (because it physically implements their semantic through logic gates).
I can more or less understand how a local instruction (an instruction executed inside the cpu chipset) such as 'ADD R1, R2, R3' works. Even how the cpu interfaces with the ram through the northbridge chipset using the data bus and the address bus is clear enough to me.
What I am struggling with is the big picture.
For example how can a file be saved into an hard disk?
Let's say that the motherboard uses a SATA interface to communicate with the HDD.
Does this mean that this SATA interface has an instruction set which can be used by the cpu by preparing SATA instructions written in the correct format?
Does the same apply with the PCI interface, the AGP interface and so on?
Is all the hardware communication basically accomplished through determining a stardard interface for some task and implementing it (by the companies that create hardware chipsets) with an instruction set that any other hardware components can query?
Is my high level understanding of hardware and software interaction correct?
Nearly. It's actually more general than an instruction.
Alot of these details are architecture specific, so I will stick to a high level general overview of how this can be done.
The CPU can read and write to RAM without a problem correct? You can issue instructions that read and write to any memory address. So rather than try to extend the CPU to understand every possible hardware interface out there, hardware manufacturers simply map sections of the address space (where RAM normally would be) to hardware.
Say for example you want to save a file to a hard drive. This is a possible sequence of command that would occur:
The command register of the hard drive controller is address 0xF00, an address that is outside of RAM but accessible to the CPU
Write the instruction to the command register that indicates we want to write to the hard drive.
There could be conceivably an address register at 0xF01 that tells the hard drive controller where to save the data
Tell the hard drive controller that the data I want to write is at some address in RAM, and initiate the write sequence.
There are a myriad of other ways this can be conceivably be done, but the important thing to note is that it is simply using the instructions that the CPU already has for using RAM.
All of this can be done by the CPU without any special instructions on the CPU side, just read and write to an address. You can imagine this being extended, there is a special place in the address space for the USB controller that contains a list of USB devices, there is a special place for the PCI device list, each PCI devices has several registers that can be read and written to instruct them to do things.
Essentially the role of a device driver is to know how these special registers are to be read and written, what kind of commands devices can accept, etc. Often, as is the case with many graphics cards, what these registers do is not documented to the public and so we rely on their drivers to run the cards correctly.

Address of Video memory

Address of video memory (0xB8000), who is mapping video memory to this address?
The routine which is copying the data from the address and putting to the screen, Is it a processor inbuilt function (Is this driver comes with the processor)?
What happens when you write to the address:
That area of the address space is not mapped to RAM, instead it gets sent across the system bus to your VGA card. The BIOS set this up with your VGA card at boot time (Lots of address ranges are memory mapped to various devices). No code is being executed on the CPU to plot the pixels when you write to this area of the address space. The VGA card receives this information instead of your RAM and does this itself.
If you wanted you could look BIOS functions calls and have it reconfigure the hardware so you could plot pixels instead of place characters at the video address instead. You could even probe it to see if it supports VESA and switch to a nice 1280*768 32bpp resolution. The BIOS would then map an area of the address space of your choosing to the VGA card for you.
More about the BIOS:
The BIOS is a program that comes with your motherboard that your CPU executes when it first powers up. It sets up all the hardware, maps all the memory mapped devices, creates various useful tables, assigns IO ports, hooks interrupts up to a bunch of routines it leaves in memory. It then loads your bootsector from a device and jumps to your OS code.
The left behind routines and data structures enable you to get your OS off the ground. You can load sectors off a disk, write text to the screen, get information about the system (Memory maps, ACPI tables, MP Tables .etc). Without these routines and data structures, it would be a lot harder if not impossible to make an acceptable bootsector and have all the information about the system to build a functional kernel.
However the routines are dated, slow and have very restrictive limitations. For one the routines left in memory are 16bit real mode code, so as soon as you switch to 32bit protected mode you have to switch back constantly or use VM86 mode to access them (Completely unaccessible in 64bit mode, Apparently emulating the instructions with a modified linux x86emu library is an option though). So the routines are generally very slow. So you will need to write your own drivers from scratch if you move away from real mode programming.
In most cases, the PC monitor is a VGA-compatible device, which by standard includes a mode for setting text buffer (32 KB sized) through MMIO starting from the address of 0xB8000.
The picture above summarizes how MMIO works.

Full emulation vs. full virtualization

In full emulation the I/O devices, CPU, main memory are virtualized. The guest operating system would access virtual devices not physical devices. But what exactly is full virtualization? Is it the same as full emulation or something totally different?
Emulation and virtualization are related but not the same.
Emulation is using software to provide a different execution environment or architecture. For example, you might have an Android emulator run on a Windows box. The Windows box doesn't have the same processor that an Android device does so the emulator actually executes the Android application through software.
Virtualization is more about creating virtual barriers between multiple virtual environments running in the same physical environment. The big difference is that the virtualized environment is the same architecture. A virtualized application may provide virtualized devices that then get translated to physical devices and the virtualization host has control over which virtual machine has access to each device or portion of a device. The actual execution is most often still executed natively though, not through software. Therefore virtualization performance is usually much better than emulation.
There's also a separate concept of a Virtual Machine such as those that run Java, .NET, or Flash code. They can vary from one implementation to the next and may include aspects of either emulation or virtualization or both. For example, the JVM provides a mechanism to execute Java byte codes. However, the JVM spec doesn't dictate that the byte codes must be executed by software or that they must be compiled to native code. Each JVM can do it's own thing and in fact most JVMs do a combination of both using emulation where appropriate and using a JIT where appropriate (the Hotspot JIT I think is what it's called for Sun/Oracle's JVM).
A Hypervisor is a supervisor of supervisors i.e. it's the kernel that controls kernels.
Type 1 vs Type 2 vs Hybrid Hypervisors
A Type 1 hypervisor is an OS designed to run VMs. It is installed directly on the disk to be executed from the boot sector like any OS; it is an OS purpose built to manage and run VMs and that's all you can do on it (and like an OS, it can be monolithic or microkernelised). All OSs installed on it run as guests.
A Type 2 hypervisor is a hypervisor that runs on top of an OS that's designed to run applications, either as an application (full emulation), or by modifying the kernel with a driver to give the OS functionality to run VMs (virtualisation), which installs itself below/alongside the host OS invisibly, and the host OS continues to run (in ring 0 in the case of software virtualisation and non-VMX mode ring 0 in the case of the driver supporting hardware virtualisation) and the hypervisor that is hooked below it manages the guests in VMX non-root mode (in the case of hardware virtualisation) or ring 1 (in the case of software virtualisation) while passing off to the host OS where appropriate and calling into and using the host OS and its drivers to access hardware (which is why it is often pictured to be above the host OS). The GUI program on the host OS communicates with the driver and there is a subprocess per VM and thread per vCPU.
A Hybrid hypervisor is an OS that's designed to run applications and VMs. It can run in regular host OS mode but it has a hypervisor mode, which when booted into loads the host OS as a guest ontop of a hypervisor, and can load other guests. The hypervisor is typically a microkernelised hypervisor, meaning the hardware drivers are implemented in the host OS (called the parent partition) rather than the hypervisor (on Hyper-V, the Integration Services Components drivers can be installed on other guests to communicate with host OS drivers via the VMBUS system that the host OS sets up). The host OS runs in VMX non-root mode with a VMCS. Theoretically you could get a paravirtualised hybrid hypervisor but KVM and Hyper-V only support hardware virtualisation, furthermore, you could also have a monolithic hybrid hypervisor, but doesn't make much sense, and because of the presence of the host OS, it only needs to be microkernelised. A hybrid hypervisor is essentially a type 1 hypervisor that can boot into type 1 hypervisor mode and host OS mode separately. A microkernelised hypervisor is typically hybrid because the host OS used is the one already installed (and which the microkernelised hypervisor functionality is already part of -- it's available as a feature install on windows server)
Fully emulated Type 2 Hypervisors
A full emulator emulates all registers of the target ISA as variables and the CPU is completely emulated. This can be due to wanting to emulate a guest whose ISA is not the same ISA as the host (or indeed it can be the same if you run an x86 emulator e.g. Bochs and you happen to be running it on an x86 system; it doesn't matter. As Peter says, the emulator does not need privileged accesses (ring 0 driver helper), because all interpretation and emulation is done local to the process and the process calls regular host I/O functions. This works because none of the code needs to run natively. If you want it to run natively, you have to bring this functionality to ring 0 via a driver). Full emulation is an emulation of everything: the CPU, the chipset, the BIOS, devices, interrupts, page walk hardware, TLBs. The emulator process runs in ring 3 but this is not visible to the guest which sees emulated/virtual rings (0 and 3) which will be monitored by the interpreter and will emulate interrupts by assigning values to the register variables on violation based on the instruction it is interpreting, mimicking what the CPU would do at each stage but in software. The emulator reads an instruction from an address, analyses it and every time a register e.g. EDX comes up, it will read the EDX variable (emulated EDX). It mimicks the operation of the CPU, which is slow because there are multiple operations for a single operation that is usually handled transparently by the CPU. If the guest attempts to access a virtual address, the dynamic recompiler takes this guest virtual address and traverses the guest page table (mimicking a tlb miss page walker) using the vCR3 and then it reads directly from each physical address produced by vCR3+guest virtual address part using the emulator process page table whose cr3 it has no control over as it is a process and as far as the host OS is concerned the physical address is just a virtual address in the process (guest physical maps to a host virtual by adding an offset and then acting like a host virtual address, so an implicit P2M table). If the dynamic recompiler detects an invalid bit on the guest PTE as it traverses using vCR3 then it simulates a page fault to the guest putting the address in the vCR2.
Software Virtualised Type 2 Hypervisors
Full virtualisation, which is a type 1 hypervisor scheme, can actually be used on type 2 hypervisors and is a step up in performance from the former and can only be used if the guest ISA is the same as the host ISA. Full virtualisation cannot be achieved on x86 because:
There are certain flaws in the implementation of ring 1 in the x86 architecture that were never fixed. Certain instructions that should trap in ring 1 do not. This affects, for example, the LGDT/SGDT, LIDT/SIDT, or POPF/PUSHF instruction pairs. Whereas the "load" operation is privileged and can therefore be trapped, the "store" instruction always succeed. If the guest is allowed to execute these, it will see the true state of the CPU, not the virtualized state. The CPUID instruction also has the same problem.
Actually, this applies to ring 3 too. It's not just a glitch with ring 1. SGDT etc is not a privileged instruction, but allowing the VM to execute it contradicts Popek and Goldberg requirements because the VM can read the real state of the CPU and get the address of the real GDT rather than the virtual. Before UMIP, software full virtualisation was not possible on x86, and before Intel VT, x86 CPUs didn't inherently conform to Popek and Goldberg's requirements, so paravirtualisation had to be used. Paravirtualisation still does not conform to Popek and Goldberg (because only kernel mode code is patched, so SGDT can be used), but at least it works, whereas full virtualisation doesn't work at all, because SGDT will read a bogus value (the host SGDT) in guest kernel mode, meaning the guest kernel code using SGDT will not work as desired if it is not patched. SGDT being available in user mode at least doesn't compromise the host OS, whereas LGDT definitely would.
VirtualBox uses ring 1 full virtualisation, but paravirtualises the problematic instructions that act like they are executing in ring 0 despite being in ring 1, and requires the help of a ring 0 driver; the driver functions as the hypervisor. Surprisingly, there is very little information on how type 2 hypervisors are implemented. The following will be my best guess on the matter -- how I would implement a type 2 hypervisor given the hardware and host OS operation.
On Windows, I'd imagine when the driver starts it will initialise symbolic links and wait for the user mode virtualbox software to issue IOCTLs using DeviceIoControl to start a virtual machine instance. The handler will perform the following process: The driver injects a handler into the IDT for the general protection fault. It can do this by putting a wrapper around KiInterruptDispatch by replacing KiInterruptTemplate in the IDT with the wrapper. On windows, it could inject a wrapper into all IDT entries including entry bug check entries but this means hooking into the IDT write routines for new interrupts. What it probably does to achieve this is read the virtual address in IDTR and write protect the region and then host updates to the IDT will trap into the hypervisor GPF wrapper which will install a wrapper at the IDT entry written to.
However, a 64 bit windows guest on a 64 bit windows host needs to be able to have its own kernel space, but the problem is, it will be at exactly the same location as the host kernel structures. Therefore, the driver needs to wipe the whole kernel view of the virtualbox process. This cannot be mapped in or visible to the guest. It does this by removing the entries from the cr3 page of the virtualbox process. The GDT and IDT used by the virtualbox process and other host processes needs to be the same, but in order to avoid reserving guest virtual addresses, when the guest writes to the IDTR, the hypervisor could use this as the actual IDTR value, but virtually map it in the SPT to the same physical 4KiB IDT frame that the host uses. This means that the hypervisor driver needs to change the IDTR when switching between the guest and host threads. Because the guest virtual page that maps the IDT is write protected, any writes to this range by the guest will be logged by the hypervisor in a guest IDT that it builds if the cr3 is of one of its guests' processes. The issue is that when the ISR is handled, it will jump to a hypervisor RIP that is not mapped into the process because the driver lies in the host kernel; therefore, the RIP of this wrapper needs to be mapped into the SPT. This means you can't get away with reserving no virtual memory in the guest, and for that reason, you could probably get away with reserving the 4KiB address range the host uses for its IDT and silently redirecting guest accesses to a different host physical page and then not having to change the IDTR on a task switch. All reserved memory for the handlers in the host IDT would also have to be redirected silently to different host physical pages (because they will be supervisor pages so they will fault anyway and the hypervisor just redirects the reads and writes to a different host physical page, which won't happen after an interrupt because it will be in ring 0, so the jump in the IDT will be in the real host physical page mapped to it as it doesn't GPF so the hypervisor can't redirect), so the guest is unaware that that region is reserved. There will be a different wrapper for each IDT entry which will call a main handler which also needs to be mapped and pass an IDT entry code. The handler will pass the cr3 in a register, change the cr3 to a dummy process that maps the host kernel and then it will call the main handler. The handler checks the cr3 and if it is a guests shadow cr3 or host cr3 and perform the appropriate action.
The driver will also have to inject itself into the clock interrupt in the same way -- if the clock interrupt fires, the guest state or host state (which includes current cr3) is pushed and the hypervisor handler will push the address of the guest IDT clock interrupt onto the kernel stacks of all vCPU threads it manages (emulating what the CPU would do) in a new trap frame if there isn't one already present and then call the original host handler after changing the cr3 to one that maps the host kernel. This would ensure a context switch in the guest every time it is scheduled in on the host and therefore guest clock interval would roughly match up to host clock interval.
Full virtualisation would be referred to as 'trap and emulate', but it is not full emulation because all ring 3 code actually runs on the host CPU (as opposed to full emulation where the code that runs is the interpreter which fetches lines to read). Also, the TLBs and page walk hardware are actually used directly whereas on the emulator, every memory access requires a walk in software if not present in an emulated TLB array in software. Only the privileged instructions and registers, interrupts, devices and BIOS are emulated to the guest -- partial emulation -- emulation still occurs, but when any amount of the code runs natively, it becomes referred to as a virtualisation (full, para or hardware assisted).
When the guest traps into the guest OS it will either use INT 0x2e or syscall. The hypervisor obviously has injected a wrapper at 0x2e for INT and it will insert a handler at the SYSENTER_CS_MSR:SYSENTER_EIP_MSR for sysenter or IA32_LSTAR MSR for syscall. The handler in the MSR needs to be mapped into the SPT and will check to see if the cr3 is the shadow of one of the guest processes and if it isn't it doesn't need to change cr3 as the current will contain the host kernel and jumps to the host handler. If it is a cr3 of a guest process, it changes the cr3 to a dummy process (probably a virtualbox host process specifically for IO tasks that maps the host kernel) and jumps to a main handler, passing RIP in guest IDT that it has built to the recompiler/patcher which walks through and paravirtualises certain instructions that aren't guaranteed to trap, replacing them permanently with jumps to hypervisor memory where it places better code (which will cause protection faults as they're ring 0 in the SPT) until it reaches a IRET or sysexit etc and then it changes back the cr3 to that of the guest and executes an IRET after putting a ring 1 privilege on the stack to the RIP in the guest IDT it has built and then the actual guest ISR executes. When a trap occurs due to executing a ring 0 instruction in ring 1 or an inserted paravirtualised trap occurs, the ISR injected at the general protection fault entry / hypervisor ISR will make sure that the cr3 is of a guest process and it will claim and handle the issue, if it isn't then the cr3 doesn't need to be changed to one that includes host kernel in order to pass control to the host handler because it will be in the context of a non guest process. One instance where this could occur is the guest writing to cr3 for a guest context switch. This needs to be emulated as the guest must not be able to execute this instruction and modify the cr3 because it would change the cr3 of the host process on the host OS; the hypervisor needs to incept the write and write a new shadow cr3 and not the cr3 the guest wants. When the guest reads cr3, this mechanism prevents the guest from reading the real cr3 and the hypervisor inserts the value of the guest inserted cr3 (not the shadow one) into the requested register, inserts next instruction address onto the stack and resumes execution with an iret to the ring it was in.
Guest I/O will be targeted at a guest physical address space that maps onto virtual buffers and registers of emulated devices defined in the hypervisor. These emulated registers (e.g. doorbell registers) will be checked in a host context at regular intervals (clock interrupt hook for instance) in the exact same way a device would react to changes to hardware registers and the handler will decide whether an interrupt needs to be emulated (pushing an interrupt onto the kernel stack of the thread representing the selected vCPU to interrupt based on MSI vector assigned by guest in the emulated configuration space) or, due to an emulated register write, an I/O operation needs to be constructed using the Native windows API functions to the guest specified buffer (translating GVA->HPA and allowing real hardware to write to the physical page that the guest buffer will use).
As for paging on a paravirtualised 64 bit type 2 hypervisor, it is a tricky one. The hardware uses a shadow page table (SPT) which is a mapping of GVAs to HPAs. My best guess is that the hypervisor driver selects a shadow cr3 page from the locked pages of the virtualbox process for every GP fault (executing ring 0 instruction in ring 1) that it sees a new guest assigned cr3 address being written to cr3. It pairs this guest chosen address with the address of the hypervisor chosen shadow cr3 page and changes the virtualbox process cr3 to that of the shadow cr3 rather than the guest one that was attempted to be written. The shadow cr3 page (you'll see written everywhere that the guest page tables are write protected but it just has to be wrong because it is the shadow page tables that run on the CPU and therefore are the only ones that can cause protection faults; the shadow cr3 is used not the guest cr3) is write protected by the kernel driver (which is done by read/write bit in the recursive PML4 entry to itself). The cr3 page of certain GPA that the guest attempts to use will be translated to its associated HPA by the hypervisor and the entries in the cr3 page will be copied to the shadow cr3 and GPA addresses in the PML4Es will be translated to the HPAs using the P2M table. Every time the guest goes to write this to the guest cr3 page by virtual address, this virtual address will always be of the shadow cr3 page, not the guest cr3 page, and it will fault because of the write protect bit and being in ring 1. The handler injected at the general protection fault will then see a shadow cr3 of one of its guests processes and it will perform the write that was conceptually attempted in guest PTE, in the SPT at the same location (where it actually faulted), and it inserts the host physical address instead of the guest physical address that it tried to write (which it translates using the P2M TLB or P2M; I think the P2M is filled when you start the VM, because the VirtualBox process uses VirtualLock to lock the specified amount of RAM for the virtual machine) (the hypervisor can maintain virtual TLBs for the P2M (guest frame to host frame mappings) and guest page tables (guest virtual page to guest frame mappings), which it can check before performing software page walks, whereas the hardware maintains the TLBs for the SPT). Then the hypervisor will check the virtual TLBs for a quick translation of the CR2 GVA to a GPA; if not present it will trace the guest page table (by accessing the guest cr3 via its HVA (translates GPA->HPA using P2M and then HPA->HVA using a kernel function)) and write to the entry as the guest wanted with the attempted guest GPA. When a page fault occurs, the handler checks the shadow cr3 is one of its guest processes and checks the SPT (gets virtual address of entry associated with faulting GVA using Windows kernel function, as if it were a regular process) and then walks the guest page table using the guest cr3 associated with the current cr3, parsing the SPT virtual address that faulted (translates GPA -> HPA -> HVA). If the shadow PTE is invalid then it is a shadow page fault. If the guest PTE is invalid as well then it emulates an interrupt using the RIP of the address in the page fault entry of the guest IDT pushing it on the stack; before it does this it patches the code in the recompiler as described before (when guest reads from its page table during the interrupt, it will actually be reading the SPT and therefore the SPT needs to be read protected with a supervisor bit so it can be intercepted and the guest page table entry can be read instead from the address in the faulting memory access). For any other interrupt that occurs i.e. a host device, it is not meant for the guest and therefore if the handler sees the current cr3 belongs to a process of one of its guests it will change the cr3 to a dummy process that contains the host kernel mapping and calls the original KiInterruptTemplate for the host handler; after the host handler has finished, it will replace the cr3.
Hardware assisted Type 2 Hypervisors
Hardware assisted type 2 is a further step up in performance and makes the situation a lot less convoluted and unifies it into a single interface and automates lots of makeshift cr3 juggling and administrative tasks that needed to be improvised, making it a lot cleaner. The kernel driver only needs to execute vmxon, wait for guests to register with the driver and then all VM Exit events will be handled by a unified handler at a RIP and CR3 it inserts into the VMCS host state (meaning the handler stub does not need to be mapped in the guest kernel virtual address space). It is specifically designed for this, unlike ring 1, which means the recompiler (Code Scanning and Analysis Manager (CSAM) and the Patch Manager (PATM)) is not required. It also has things like TSC scaling and TSC offset fields which can be used by guests which employ the TSC for fairer scheduling. The hypervisor still hooks the clock interrupt to perform I/O updates and if the currently executing thread is the address of the thread for one of its vCPUs, it will need to vxmoff (which will cause a VM exit) and push the address of some reinitialisation sequence in host kernel memory that will vmxon and vmresume the VMCS tied to the vCPU with the guest saved state in it (but with an emulated clock interrupt in place ready to execute, whose code will use RDTSC which will VM exit and the offsets in VMCS can be used by the hypervisor to report a value accounting for time the guest wasn't scheduled in on the host, i.e. to subtract host time away from it to make the host invisible). It doesn't need to change the cr3 because the vmxoff does that automatically so now it can pass it to the host handler to perform the clock interrupt handing procedure for the host OS.
If EPT is supported, then the guest chosen physical addresses (cr3, IDTR etc.) and page tables run on the actual hardware in vmx non-root mode. GVAs are translated to HPAs as such: the guest CR3 is used to produce a GPA of the PDPT, which is then run through the whole EPT using the EPTP of the guest to eventually get the HPA of the PDPT, and so on (it's the same process as software virtualisation with the guest page table and the P2M, except the page walk is done on actual page walk hardware, which is faster). When there is a page fault, a vm exit does not occur and the guest chosen IDTR is present so the interrupt gets handled as a non root ring 0 using the guest IDT. The guest can update this mapping and the hypervisor doesn't need to intervene. When the access is reattempted, an EPT fault will cause a VM exit with the EPTP equivalent of cr2 and a pointer to the hypervisors EPTP for the guest. It will then update its mapping and VMRESUME to the RIP of the faulting instruction.
In full emulation the I/O devices , CPU , main memory are virtualized.
No, they are emulated in software. Emulated means that their behavior is completely replicated in software.
But what exactly is full virtualization?
With virtualization, you try to run as much code as you can on the on hardware to speed up the process.
This is especially a problem with code that had to be run in kernel mode, as that could potentially change the global state of the host (machine the Hypervisor or VMM is running on) and thereby affect other virtual machines.
Without either emulation or virtualization, code runs directly on the hardware. Its instructions are executed natively by the CPU, and its I/O accesses directly access the hardware.
Virtualization is when the guest code runs natively at least some of the time, and only traps to host code running outside the virtual-machine (e.g. a hypervisor) for privileged operations or I/O accesses.
To handle these traps (aka VM exits), the VM may actually emulate what the guest was trying to do. E.g. the guest might be running a device driver for a simple network card, but the NIC is implemented purely in software in the VM. If the VM used a pass-through to send the guest's I/O accesses to a real network card on the host, that would be virtualization of that hardware. (Especially if it did it in a way that let multiple guest use it at once, otherwise it's really just giving it to one guest, not virtualizing it.)
Hardware support for virtualization (like Intel's and AMD's separate x86 virtualization extensions) can let the guest do things that would normally affect the whole machine, like modify the memory mappings in a page table. So instead of triggering a VM exit and making the VM figure out what the guest was doing and then modifying things from the outside to achieve the result, the CPU just has an extra translation layer built in. (See the linked wiki article for a much better but longer description of software-based virtualization vs. hardware-assisted virtualization.)
Pure emulation means that guest code never runs natively, and never sees the "real" hardware of the host. An emulator doesn't need privileged access to the host. (Some might want privileged access to the host for device pass-through, or for raw network sockets to let a guest look like it's really attached to the same network as the host).
An ARM emulator running on an x86 host always has to work this way, because the host hardware can't run ARM instructions in the first place.
But you can still emulate an x86 guest on an x86 host, for example. The fact that the guest and host architectures match doesn't mean the emulator has to take advantage of that fact.
For example, BOCHS is an x86 PC emulator written in portable C++. One of its main uses is for debugging bootloaders and OSes.
BOCHS doesn't care if it's running on an x86 host or not. It's just a C++ program that reads binary files (disk images) and draws in a window (contents of guest video memory). As far as the host is concerned, it's not particularly different from a JPG viewer or a game.
Some emulators use binary translation to JIT-compile the guest code into host code, but this is still emulation, not virtualization. See http://wiki.osdev.org/Emulator_Comparison.
BOCHS is relatively slow, since it reads and decodes guest instructions directly, without doing binary translation. But it tries to do this as efficiently as possible. See How Bochs Works Under the Hood for some of the tricks it uses to efficiently keep track of the guest state. Since emulation is the only option for running x86 software on non-x86 hardware, it's useful to have a high-performance emulator. BOCHS has some very smart and experienced emulator developers working on it, notably Darek Mihocka, who has some interesting articles about optimizing emulation on his site.
This is an attempt to answer my own question.
System Virtualization : Understanding IO virtualization and role of hypervisor
Virtualization
Virtualization as a concept enables multiple/diverse applications to co-exist on the same underlying hardware without being aware of each other.
As an example, full blown operating systems such as Windows, Linux, Symbian etc along with their applications can coexist on the same platform. All computing resources are virtualized.
What this means is none of the aforesaid machines have access to physical resources. The only entity having access to physical resources is a program known as Virtual Machine Monitor (aka Hypervisor).
Now this is important. Please read and re-read carefully.
The hypervisor provides a virtualized environment to each of the machines above. Since these machines access NOT the physical hardware BUT virtualized hardware, they are known as Virtual Machines.
As an example, the Windows kernel may want to start a physical timer (System Resource). Assume that ther timer is memory mapped IO. The Windows kernel issues a series of Load/Store instructions on the Timer addresses. In a Non-Virtualized environment, these Load/Store would have resulted in programming of the timer hardware.
However in a virtualized environment, these Load/Store based accesses of physical resources will result in a trap/Fault. The trap is handled by the hypervisor. The Hypervisor knows that windows tried to program timer. The hypervisor maintains Timer data structures for each of the virtual machines. In this case, the hypervisor updates the timer data structure which it has created for Windows. It then programs the real timer. Any interrupt generated by the timer is handled by the hypervisor first. Data structures of virtual machines are updated and the latter's interrupt service routines are called.
To cut a long story short, Windows did everything that it would have done in a Non-Virtualized environment. In this case, its actions resulted in NOT the real system resource being updated, but virtual resources (The data structures above) getting updated.
Thus all virtual machines think they are accessing the underlying hardware; In reality unknown to them, all accesses to physical hardware is mediated through by the hypervisor.
Everything described above is full/classic virtualization. Most modern CPUs are unfit for classic virtualization. The trap/fault does not apply to all instructions. So the hypervisor is easily bypassed on modern devices.
Here is where para-virtualization comes into being. The sensitive instructions in the source code of virtual machines are replaced by a call to Hypervisor. The load/store snippet above may be replaced by a call such as
Hypervisor_Service(Timer Start, Windows, 10ms);
EMULATION
Emulation is a topic related to virtualization. Imagine a scenario where a program originally compiled for ARM is made to run on ATMEL CPU. The ATMEL CPU runs an Emulator program which interprets each ARM instruction and emulates necessary actions on ATMEL platform. Thus the Emulator provides a virtualized environment.
In this case, virtualization of system resources is NOT performed via trap and execute model.
A more recent response:
From my research i can say that this is a better response to understand how concept appear:
The first concept of emulation actually dates back to the first computer, the Colossus. It was used by the British government in 1941 to mimic the functions of the Nazi Enigma code machine. Emulation theory was developed in 1962 and was conceived by three IBM engineers working from three different angles.
Emulation means to mimic the behavior of the target which can be hardware, like the emu8086 emulator, or can be software like emulation of a service from some network port.
You want to immitate the set of functions provided by the target and maybe you are not interested in the internal mechanism.
Why would you want that? For controlling that functions. Why control? For multiple reason which is very large subject to be discuss here. But keep in mind that you want to be behind the things.
But such process is costly for performance. You have an instruction for which are executed a lot of other instruction. Maybe you are interested to control only some of that instructions. So we would like to permit some of instructions to be executed native.
So what happens when all of this instructions execution became native? Then you have ideal virtualization. You can virtualize any software, but the trend today is to pass from virtualization of operating systems to that of application. Also i say ideal because this software have a different execution on each hardware so it will be need to also emulate some instructions.Is important to understand that most of virtualize technologies from today are not only about virtualize, but also about emulation.
Also notice that in our transition from emulation to virtualization, the input which of system is reduced, because virtualization accept only software as input. The controller of these flow of instructions is named HyperVisor.
Virtualization may happen at different layers of a computer architecture, which are (from higher to lower): 1: Application, 2: Library, 3: Operating System, 4: Hardware Abstraction (HAL), 5: Instruction Set Architecture (ISA). Below the latter layer there is the Hardware.
Tipically a certain layer utilizes services from a lower layer by utilizing the instructions the lower layer exposes in its interface.
Note that the usage of service is not strictly related to the layering, in the sense that certain layers can skip the layer immediately below and utilize instruction from lower layers. As an example an Applications may provide certain instructions directly to the HAL layer, skipping the Library and O.S. layers.
To "emulate an instruction" means to intercept and map an instruction intended for a certain layer of a computer architecture (virtual) into a sequence (one or more) instruction(s) for the same layer of a different computer architecture (non-virtual).
It is possible to place the virtualization layer at different layers of a Computer Architecture. This point may introduce confusion.
As an example, when virtualizing at the level of the Hardware Abstraction Layer (e.g. VMware, VirtualBox), a virtual layer is placed between the HAL layer and the Operating system Layer. The Operating system utilizes instructions of the virtual HAL Layer, then certain virtual ISA (Instruction Set Architecture) are mapped by the hypervisor to ISA for the physical system. When ALL the instruction are emulated, we talk about full emulation, which is a special case of virtualization. In virtualization tipically we try to make a layer to execute directly instruction of the non-virtual layer as much as possible for performance reasons.
In another example, the virtualization layer is placed over the Operative System (Virtualization at Operative System Level): in this case a Virtual Machine is named Container (e.g. Docker). It includes the levels from Application to the O.S. (included).
To conclude, emulation is related to single instruction, while "full emulation" happens when we intercept and map ALL the instructions of a certain layer.
Tipically, the term "full emulation" is used when the virtualization layer is placed at the ISA level (lower level possible). In this case a Virtual Machine includes all the levels from the Application to the ISA, and ALL the ISA are intercepted and mapped. This is tipically used to virtualize niche products, such as Cisco routers (e.g. with QEMU) or 90's video game consoles, having a completely different architecture from the usual commonly available computers. Note however that there may be a "full emulation" also at other levels, which is tipically not necessary.
Virtualization and Emulation are pretty much the same thing. There is one underlying concept that these two words hint at. That is, these two words are aspects of one thing. This is demonstrated in QEMU, a Quick Emulator that performs hardware virtualization.
You can think of that one thing as Simulation. Simulation can also be a confusing word though.
First we can define the common meaning of the words.
Simulation: Making one thing do what another thing does.
Emulation: Making one system replicate another system exactly.
Virtualization: Allow for running of a system within another system.
Now we show that the words all mean pretty much the same thing. For example, in simulation you are creating a replica of one system with another system. That is the common meaning of emulation. In virtualization, you want to have your virtualized system act like the real system. That is, ideally it acts like a replica, even though it may be implemented differently and may not "emulate" the hardware exactly. That is the same as simulation pretty much. In an emulation, you simulate another system, etc..
So we can see that the words are somewhat interchangeable. The underlying concept is simulation.
In virtualization, such as operating system virtualization ("virtual machines"), we are creating a system which acts like the operating system. It might use tricks from the underlying hardware, or hypervisors, or other things, for performance and security. But in the end it is just a simulation of an operating system. Typically when the word "virtual machine" is used, it is not an exact replica of the machine (as in an emulator). It just does enough to allow programs to run as you would expect on the real operating system.
In emulation, it is typically meant that the simulation is "exact". In hardware emulation, you replicate all of the features of the hardware system. This means that you have created a simulation of the hardware. You could say that you created a virtualization of the hardware, but here is where virtualization slightly differs. Virtualization implies creating an isolated environment, which emulation doesn't necessarily imply. So a hardware emulator might provide the same interface to the hardware as the hardware itself, but the implementation of the emulator might rely on global memory, so if you try to run two emulators at the same time, they would interfere with each other. This is what virtualization solves, it isolates the simulations.
Hope that helps.
I think it's a common misconception to oppose Virtualization to Emulation when they're not comparable.
What people have in mind when they talk about Virtualization is mostly what type 2 hypervisors do.
According to wikipedia, virtualization is :
Virtualization or virtualisation (sometimes abbreviated
v12n, a numeronym) is the act of creating a virtual (rather than
actual) version of something, including virtual computer hardware
platforms, storage devices, and computer network resources.
This definition suits both emulation and type 2 hypervisor. Therefore, an Emulator is a subtype of virtualization, and Type 2 Hypervisor is another subtype. Both let you run a virtual machine, but the way they work and the way they're used often differ. Many virtual machines actually rely on both techniques to achieve their goal.
Moreover, emulation doesn't always replicate the original hardware 1:1 (by design and not by lack of documentation), such as DOSBox which simulates a kind of PC that doesn't really exist, or high level emulators (like the old Ultra HLE). This makes the emulator more efficient (but with the risk of breaking compatibilty with softwares). Other emulators also do this for a different purpose : to expand the capabilities of the original hardware (such as dolphin that let you run the game in 4K, or PS1 emulators that let you improve dramatically the quality of the 3D, or more recently, a SNES emulator with a modified PPU that can output 16:9 graphics and that's used for a modded super mario world patched to run in widescreen).
Some emulators can also use hardware ressources like a video card. An example of this is Connectix VirtualPC, an old PC emulator for PowerPC based macs. Back then macs and PCs both had PCI slots, and Connectix VirtualPC gave you the possibility to use a video card that was physically in your mac (which also existed on PC).
I hope this clarifies things.