How does VirtualBox-like virtualization work? (some technical details required) - operating-system

First consider the situation when there is only one operating system installed. Now I run some executable. Processor reads instructions from the executable file and preforms these instructions. Even though I can put whatever instructions I want into the file, my program can't read arbitrary areas of HDD (and do many other potentially "bad" things).
It looks like magic, but I understand how this magic works. Operating system starts my program and puts the processor into some "unprivileged" state. "Unsafe" processor instructions are not allowed in this state and the only way to put the processor back to "privileged" state is give the control back to kernel. Kernel code can use all the processor's instructions, so it can do potentially unsafe things my program "asked" for if it decides it is allowed.
Now suppose we have VMWare or VirtualBox on Windows host. Guest operating system is Linux. I run a program in guest, it transfers control to guest Linux kernel. The guest Linux kernel's code is supposed to be run in processor's "privileged" mode (it must contain the "unsafe" processor instructions!). But I strongly doubt that it has an unlimited access to all the computer's resources.
I do not need too much technical details, I only want to understand how this part of magic works.

This is a great question and it really hits on some cool details regarding security and virtualization. I'll give a high-level overview of how things work on an Intel processor.
How are normal processes managed by the operating system?
An Intel processor has 4 different "protection rings" that it can be in at any time. The ring that code is currently running in determines the specific assembly instructions that may run. Ring 0 can run all of the privileged instructions whereas ring 3 cannot run any privileged instructions.
The operating system kernel always runs in ring 0. This ring allows the kernel to execute the privileged instructions it needs in order to control memory, start programs, write to the HDD, etc.
User applications run in ring 3. This ring does not permit privileged instructions (e.g. those for writing to the HDD) to run. If an application attempts to run a privileged instruction, the processor will take control from the process and raise an exception that the kernel will handle in ring 0; the kernel will likely just terminate the process.
Rings 1 and 2 tend not to be used, though they have a use.
Further reading
How does virtualization work?
Before there was hardware support for virtualization, a virtual machine monitor (such as VMWare) would need to do something called binary translation (see this paper). At a high level, this consists of the VMM inspecting the binary of the guest operating system and emulating the behavior of the privileged instructions in a safe manner.
Now there is hardware support for virtualization in Intel processors (look up Intel VT-x). In addition to the four rings mentioned above, the processor has two states, each of which contains four rings: VMX root mode and VMX non-root mode.
The host operating system and its applications, along with the VMM (such as VMWare), run in VMX root mode. The guest operating system and its applications run in VMX non-root mode. Again, both of these modes each have their own four rings, so the host OS runs in ring 0 of root mode, the host OS applications run in ring 3 of root mode, the guest OS runs in ring 0 of non-root mode, and the guest OS applications run in ring 3 of non-root mode.
When code that is running in ring 0 of non-root mode attempts to execute a privileged instruction, the processor will hand control back to the host operating system running in root mode so that the host OS can emulate the effects and prevent the guest from having direct access to privileged resources (or in some cases, the processor hardware can just emulate the effect itself without getting the host involved). Thus, the guest OS can "execute" privileged instructions without having unsafe access to hardware resources - the instructions are just intercepted and emulated. The guest cannot just do whatever it wants - only what the host and the hardware allow.
Just to clarify, code running in ring 3 of non-root mode will cause an exception to be sent to the guest OS if it attempts to execute a privileged instruction, just as an exception will be sent to the host OS if code running in ring 3 of root mode attempts to execute a privileged instruction.

Related

How can operating system detect live resize of disk capacity?

I saw the following discussion and had some questions:
live resize of a NVMe drive
If the physical capacity of the nvme device changes (e.g., from 10GB to 20GB), how the operating system detect it without rebooting?
In the above link, re-scanning pci bus is solution.
When the re-scanning be executed, does the operating system ask the nvme device to update its meta-info (e.g., capacity, etc.) ?
How does OS interact with disk specifically? (How to read changed device parameters from the disk, not the old device parameters in memory?)
This is an AWS virtual-machine probably so the disk is actually a virtual-disk. You can't resize a physical disk like you upgrade its capacity physically (you'd need to change the disk).
With that said, this machine probably runs on top of a type 1 hypervisor. What I understand about these is that the virtual machines (VMs) run as processes on a different ring on top of a minimal operating-system (hypervisor). When the VMs execute privileged instructions, it will trigger a protection fault and the hypervisor can thus inspect who actually triggered the fault (was it the guest kernel or a user mode process within the guest kernel?). If it was the guest kernel than it can execute that instruction on behalf of the guest. Otherwise, it will probably do what a real kernel would do (trigger an exception). It can tell the difference because the guest kernel runs in a different ring than the ring 3 (user mode).
With that said, the NVME device isn't PCI it is NVME. The host controller of the NVME drive is PCI. To rescan the NVME drives, you will read/write to some registers that are memory mapped in RAM and ask the NVME PCI host controller what is the size of the different disks that are found. PCI is known to be hot pluggable (similarly to USB) in some cases but mostly not on consumer motherboards. I don't think you'll get any interrupt when a PCI device is hot plugged so you are left doing a rescan of the devices.
For NVME, it depends on the host controller if you'll get an interrupt when a disk is swapped/changes size. As to virtual-disks, it probably depends on a lot of different things. You could definitely be left doing a PCI rescan here. I guess it depends on the hypervisor, on the OS and on the host controller configurations.

How does Virtualization works internally?

I was reading about virtualization and a doubt popped in my head. How does virtualization works internally at Operating System level? The topic was discussed in my class and I came across this.
A virtual box runs like a process with some extra privileges on the host operating system.
My doubt is, If VM is running as a process then who provides this extra privilege to it so that it could actually interfere with the underlying OS and hardware resources.
I read what a Hypervisor is: http://searchservervirtualization.techtarget.com/definition/hypervisor
A hypervisor is the connection between VM and host OS. Running hypervisor on a host OS means we will run it as a user process. Again my doubt is the same.
How can a user process (which is a hypervisor) control the host processor and resources? As far as I knew user processes dont have those rights.
Thanks in advance.

how does a hypervisor knows that a privileged instruction happened inside a VM?

I've started reading about VMM and wondered to myself how does the hypervisor knows a privileged instruction (for ex, cpuid) happened inside a VM and not real OS ?
let's say I've executed cpuid, a trap will occur and a VMEXIT would happen, how does the hypevisor
would know that the instruction happened inside my regular OS or inside a VM ?
First off, you are using the wrong terminology. When an OS runs on top of a hypervisor, the OS becomes the VM (virtual-machine) itself and the hypervisor is the VMM (=virtual machine monitor). A VM can also be called "guest". Thus: OS on top of hypervisor = VM = guest (these expressions mean the same thing).
Secondly, you tell the CPU that it's executing inside the VM from the moment you've executed VMLAUNCH or VMRESUME, assuming you're reading about Intel VMX. When for some reason the VM causes a hypervisor trap, we say that "a VM-exit occured" and the CPU knows it's no longer executing inside the VM. Thus:
Between VMLAUNCH/VMRESUME executions and VM-exits we are in the VM and CPUID will trap (causing a VM-exit)
Between VM-exits and VMLAUNCH/VMRESUME executions we are in the VMM (=hypervisor) and CPUID will NOT TRAP, since we already are in the hypervisor
Instructions that are privileged generate exceptions when executed in user mode. The exception is usually an undefined instruction exception. The hypervisor hooks this exception, inspects the executing instruction and then returns control to the VM. When the host OS calls the same instruction, it is in a supervisor or elevated privilege and usually no exception is generated when it executes the instruction. So generally, these issues are handled by the CPU.
However, if an instruction is not available on the processor (say floating point emulation), then the hypervisor may emulate for the VM and chain to the OS handler if not. Possibly it may even allow the OS to handle the emulation for both VMs and user tasks in the OS.
So generally, this question is unanswerable for a generic CPU. It depends on how the instruction is emulated in the VM. However, the best case is that the hypervisor does not emulate any OS instructions. Emulations will not only slow down the VMs, but the entire CPU, including user processes in the host OS.

Regd Harware assisted Virtualization

I am trying to understand hardware assisted virtualization for a project with ARM CortexA8 and using the ARM Trustzone feature. I am new to this topic therefore I started with Wiki entries to understand more.
Wikipedia explains hardware assisted virtialization and adds a line in the definitionas:
Full virtualization is used to simulate a complete hardware
environment, or virtual machine, in which an unmodified guest
operating system (using the same instruction set as the host machine)
executes in complete isolation.
The text in bold is a bit confusing. How is the same instruction set of the processor used to provide two isolated environment? Can someone explain it? ArmTrustzone manual also talk of a "virtual processor core" to provide security. Please throw some light.
thanks
The phrase "using the same instruction set as the host machine" means that the guest OS is not aware of the virtualization layer and behaves as if it is executed on a real machine (with the same instruction set). This is in contrast to the para-virtualization paradigm in which the guest OS is aware of virtualization and calls some specific VMM functions, i.e. hypercalls.
No, CPU has not additional instructions. Virtual machine instruction set is translated by a hypervisor component called VMM (virtual machine manager) to be executed on the physical CPU.
Physical CPU with assisted Virtualization introduced only a new ring 0 mode called VMX that allow the virtual machine to execute some instructions in ring 0.

Non-Hypervisor Virtualization vs Type2 Hypervisor

According to a marked answer on stackoverflow.com here and another reference here, I understand that :
Hypervisor virtualization = below the OS and a hardware virtualization where the hardware is designed to support virtualization
Non-Hypervisor virtualization = on top of the OS (like an application software), that is purely software virtualization
But we do also have Type1 and Type2 classifications for hypervisors and it seems to me that Type2 is purely Software Virtualization ... so does this mean that Non-Hypervisor Virtualization is equivalent to Type 2 Hypervisor or are there some subtle differences??
Or is it the case that these terms all are loosely defined??
Thanks in advance.
it seems to me that Type2 is purely Software Virtualization
Don't conflate "Type 1 vs Type 2" and "Hardware vs Software" Virtualization. In fact, there is actually a middle ground between hardware and software: There is Full hardware (HVM), "partial" hardware (PVM), and Pure Software (SW).
I'll try to clarify by expanding all 6 combinations:
Type 1 + Full Hardware (HVM) - This allows a hypervisor like Xen HVM to boot an unmodified guest OS. This is actually slow because the hypervisor must decode "telegraph messages" that the guest OS is trying to send to the hardware. (i.e. writing to the disk drive involves repeatedly storing bytes in location 0xblahblah.)
Type 1 + Paravirtualization (PVM) - This is when you modify the guest OS a little to call the Hypervisor directly for some tasks, like disk I/O. This is faster because the guest just says "here, write this page of bytes" and doesn't have to do (virtualized) I/O on each byte. You know you're doing PVM when you install special drivers. Of course, sometimes the OS has virtual drivers built in already. For example, any modern Linux kernel will switch to PVM mode at boot automatically when it detects it's running under Xen, KVM, UML, etc.
Type 1 + Pure Software (SW) - I'm not sure if this exists, but it wouldn't be that hard to build. Since software emulation is slow, the overhead of booting a real OS and running Type 2 isn't a big deal.
Type 2 + Full Hardware (HVM) - This allows you to boot an un-modified Windows under VirtualBox or KVM. You know it's type 2 when you can reboot all your Guests and still play MP3s in the background :)
Type 2 + Paravirtualization (PVM) - This happens any time you install guest drivers, or boot a modern Linux kernel under VirtualBox/KVM.
Type 2 + Pure Software (SW) - early versions of Bochs and Qemu. (Latter versions actually have hardware assisted modes too.) You can tell they are "pure software" because they allow you to run software that you normally can't run without it. (i.e. I've run Windows '95 under Bochs on an ARM processor, and I've booted an ARM distro on an x86 under Qemu.)
There is also another subject that is unlike the above:
Container technology. Containers like Docker/Rkt/LXD don't fit in the above table. Applications in Containers are ordinary programs calling the kernel in ordinary ways, no Hypervisor involved.
It's just that containers use the Kernel features of cgroups and namespaces to make an app "feel" like it's in a VM. Each container gets a 'partitioned' view of the system: It's own filesystem, it's own user IDs, it's own process IDs, it's own hostname + IP address, etc. But from the outside, you can see all processes in all containers with 'ps'.
In my mind, Non-Hypervisor virtualization means a virtualization layer that runs something OTHER than an OS on top of it -- most commonly virtualizing the user-level environment of some other operatoring system. For example, the WINE project is non-hypervisor virtualization -- it allows running win32 programs on a linux (or other) host. There's no attempt to run an actual Windows OS or emulate 'bare' hardware for a virtualized OS. Instead the virtual layer provides the user-level abstractions and system calls for windows directly.
Contrast this with a hypervisor which may be either type 1 (running on bare metal) or type2 (running on an OS) and which provides hardware-level abstractions and which you run an entire OS on top of.
A Hypervisor, by definition, emulates hardware. (That may or may not physically exist) - it may virtualize some as well.
Virtualization intercepts a call and redirects it elsewhere.
They are two different but interrelated topics.
Type 1 Hypervisors run on "bare metal" and sit between the hardware and your virtual operating systems (the hypervisor itself is the operating system). For example, VMWare ESX, Citrix XenServer or Microsoft Hyper-V
Type 2 Hypervisors run on top of your existing operating system and may support either hardware or software virtualization. For example both QEmu and Bochs) emulate an entire CPU, optionally even a different CPU architecture. Both are Type 2 Hypervisors but have significant overhead on performance due to the emulation required.
VMware Workstation/Server/Player/Fusion, Parallels, Virtualbox are all examples of Type 2 hypervisors that support Hardware-assisted Virtualization - this means rather than emulating the CPU instructions, the CPU instructions can pass through directly with no emulation or translation -- effectively running with no loss of performance if the processor supports it.
Next up, non-hypervisor virtualization which is (effectively) application virtualization. The hardware itself is not being emulated in any way at all, the virtualization layer is just intercepting certain system calls and virtualizing those. Examples in this category include VMWare Thinapp, Microsoft App-V and many more. Windows Vista itself virtualizes certain registry and disk writes to areas where the user doesn't have permission to write. This virtualization in Vista is critical for backwards compatibility with many legacy applications.
Finally we have pure emulators - no virtualization is happening here. In this category we have WINE and to some extent Cygwin. Also Bochs fits in this category as well as a Type 2 Hypervisor since there is no virtualization, just hardware emulation. DOSEMU is another one that fits in here.
I'm sure I've missed plenty of examples, but
(I'll post my comment to #answer-16868851 here since I miss few points to fulfill "You must have 50 reputation to comment" requirement)
BraveNewCurrency writes:
Type 1 + Pure Software (SW) - I'm not sure if this exists, but it wouldn't be that hard to build. Since software emulation is slow, the overhead of booting a real OS and running Type 2 isn't a big deal.
So far I've found only one Type 1 hypervisor capable of doing this -- it's VMware ESXi.
vSphere 5 Documentation Center | ESXi Hardware Requirements say:
■ To support 64-bit virtual machines, support for hardware virtualization (Intel VT-x or AMD RVI) must be enabled on x64 CPUs.
Hence, 32-bit guests work without VT-x in it.
As I see zero competition for it camed (either proprietary or opensource), I guess trapping sensitive CPU instructions without VT-x support (that is in Pure Software) is serious challenge in practice.
While following doesn't relate to the original question already, v5.0 (and v4.x) requires 64-bit support from CPU though:
■ ESXi 5.0 will install and run only on servers with 64-bit x86 CPUs.
■ ESXi 5.0 requires a host machine with at least two cores.
Those interested in running Type 1 + SW hypervisor on 32-bit machines (like me) may use it's earlier versions. Minimum system requirements for installing ESXi/ESX (1003661) says:
ESX 3.5.x
The hardware requirements for ESX 3.5.x are the same as those listed in the ESX 3.0.x section, with the following additions.
[...]
ESX 3.0.x
You need these hardware and system resources to install and use ESX 3:
At least two processors:
1500 MHz Intel Xeon and above, or AMD Opteron (32bit mode) for ESX
1500 MHz Intel Xeon and above, or AMD Opteron (32bit mode) for Virtual SMP
1500 MHz Intel Viiv or AMD A64 x2 dual-core processors
+ ESX 3.5 Installation Guide repeats this in following section / subsection:
ESX Server 3 Requirements
This section discusses the minimum and maximum hardware configurations supported by ESX Server 3 version 3.5.
Minimum Server Hardware Requirements
...
Hence, Pure (and 32-bit only) Software :)