Why we need binary translation in full virtualization? - virtualization

In hardware assisted virtualization, guest operation system runs on Ring 0, therefore it can run privileged instruction directly, am I right?
So why in full virtualization, VMM just won't run guest privileged instructions on Ring 0? why we need to translate them?
One reason which come into mind is different architectures (Different guest and host). is there anything more?

therefore it can run privileged instruction directly, am I right?
No, it is not completely true. Privileged instructions would still attempt accessing privileged resources and thus cannot be allowed to see/change them behind VMM's back. Therefore they would trap. That is why a classic VMM executes guests with "trap-and-emulate" approach. The majority of guest instructions that are non-privileged are executed directly, and privileged ones trap and a emulated one-by-one. No translation, that is, transformation of large (>1 guest instruction) blocks of the code is required in any case.
Alternatively, a system resource can be made non-privileged and thus instructions accessing it turn into innocuous inside the virtualized environment.
So why in full virtualization, VMM just won't run guest privileged instructions on Ring 0?
"Ring 0" is just a number, it does not mean much except that certain instructions receive new semantics: instead of faulting as they would do on the higher rings they are allowed to access system resources. But inside a VMM, they are not allowed to do that.
why we need to translate them?
We don't, individual privileged instructions may be trapped and then emulated, or interpreted. "Translation" as a term has meaning only for blocks of instructions.
One reason which come into mind is different architectures
That is a some sort of a degenerative case when 100% of guest instructions are "privileged", i.e. they will not behave as expected on chosen host. It does not make sense to attempt executing them directly, and interpreting each and every of them is too slow for many applications. This is where translation == compilation of bigger blocks starts making sense.
is there anything more?
For Intel architecture, there are certain architectural idiosyncrasies that sometimes make the idea of (temporarily) disabling hardware-assisted virtualization and falling back to binary translation beneficial in terms of speed and correctness. However, I assume this topic to be part of another, more specific question, as the answer is quite involved and requires deep understanding of Intel VT-x.

Related

Who decides which instructions are to be kept privileged? Is it the hardware manufacturer or the OS developers

I read that there are some privileged instructions in our system that can be executed in kernel mode. But I am unable to understand who make these instructions privileged . Is it the hardware manufacturer that hardwire some harmful instructions as privileged with the help of mode bit or is it the OS designers that make instructions privileged make them work only in privileged mode.
Kernel vs. user mode, and which instructions aren't allowed in user mode, is part of the ISA. That's baked in to the hardware.
CPU architects usually have a pretty good idea of what OSes need to do and want to prevent user-space from doing, so these choices at least make privilege levels possible, i.e. make it impossible for user-space to simply take over the machine.
But that's not the full picture: on some ISAs, such as x86, later ISA extensions have added control-register flag bits that let the OS choose whether some other instructions are privileged or not. On x86 that's done for instructions that could leak information about kernel ASLR, or make timing side-channels easier.
For example, rdpmc (read performance monitor counter) can only be used from user-space if specially enabled by the kernel. rdtsc (Read TimeStamp Counter) can be read from user-space by default, but the TSD (TimeStamp Disable) flag in CR4 can restrict its use to priv level 0 (kernel mode). Stopping user-space from using high-resolution timing is a brute-force way of defending against timing side-channel attacks.
Another x86 extension defends against leaking kernel addresses to make kernel ASLR more secret; CR4.UMIP (User Mode Instruction Prevention) disables instructions like sgdt that reads the virtual address of the GDT. Those instructions were basically useless for user-space in the first place, and unlike rdtsc easily could always have been privileged.
The Linux Kernel option to enable use of this extension describes it:
The User Mode Instruction Prevention (UMIP) is a security feature in newer Intel processors. If enabled, a general protection fault is issued if the SGDT, SLDT, SIDT, SMSW or STR instructions are executed in user mode. These instructions unnecessarily expose information about the hardware state.
The vast majority of applications do not use these instructions. For the very few that do, software emulation is provided in specific cases in protected and virtual-8086 modes. Emulated results are dummy.
Setting a new address for the IDT/GDT/LDT (e.g. lgdt/lidt) is of course a privileged instruction; those let you take over the machine. But until kernel ASLR was a thing, there wasn't any reason to stop user-space from reading the address. It could be in a page that had its page-table entry set to kernel only, preventing user-space from doing anything with that address. (... until Meltdown made it possible for user-space to use a speculative side-channel to read data from kernel-only pages that were hot in cache.)

what is the difference between user mode and kernel mode in terms of total number of machine instructions available?

I read this paragraph from " Modern Operating Systems , Tanenbaum "
Most computers have two modes of operation: kernel
mode and user mode. The operating system is the most fundamental piece of software and runs in kernel mode (also called supervisor mode). In this mode it has complete access to all the hardware and can execute any instruction the machine is capable of executing. The rest of the software runs in user mode, in which only a subset of the machine instructions is available.
I am unable to get how they are describing difference in these two modes on basis of machine instructions available , at user end any software has the capability to make any changes at the hardware level ,like we have software which can affect the functioning of CPU , can play with registry details , so how can we say that at user mode , we have only subset of machine instructions available ?
The instructions that are available only in kernel mode are tend to be very few. These instructions are those that are only needed to manage the system.
For example, most processors have a HALT instruction that stops the CPU that is used for system shutdowns. Obviously you would not want any user to be able to execute HALT and stop the computer for everyone. Such instructions are then made only accessible in kernel mode.
Processors use a table of handlers for interrupt and exceptions. The Operating system creates such a table listing the handlers for these events. Then it loads register(s) giving the location(and size) of the table. The instructions for loading this register(s) are kernel mode only. Otherwise, any application could create total havoc on the system.
Instructions of these nature will trigger an exception if executing in user mode.
Such instructions tend to be few in number.
Well, in user-mode, there is definitely a subset of instructions available. This is the reason we have System Calls.
Example:
A user wants to create a new process in C. He cannot do that without entering kernel-mode, because certain set of instructions are only available to kernel-mode, So he uses the system call fork, that executes instructions for creating a new process (not available in user-mode). So System call is a mechanism of requesting a service from kernel of the OS to do something for the user, which he/she cannot write code for.
Following excerpt from above link sums it up in the best way:
A program is usually limited to its own address space so that it
cannot access or modify other running programs or the operating system
itself, and is usually prevented from directly manipulating hardware
devices (e.g. the frame buffer or network devices).
However, many normal applications obviously need access to these
components, so system calls are made available by the operating system
to provide well defined, safe implementations for such operations. The
operating system executes at the highest level of privilege, and
allows applications to request services via system calls, which are
often initiated via interrupts. An interrupt automatically puts the
CPU into some elevated privilege level, and then passes control to the
kernel, which determines whether the calling program should be granted
the requested service. If the service is granted, the kernel executes
a specific set of instructions over which the calling program has no
direct control, returns the privilege level to that of the calling
program, and then returns control to the calling program.

OS memory isolation

I am trying to write a very thin hypervisor that would have the following restrictions:
runs only one operating system at a time (ie. no OS concurrency, no hardware sharing, no way to switch to another OS)
it should be able only to isolate some portions of RAM (do some memory translations behind the OS - let's say I have 6GB of RAM, I want Linux / Win not to use the first 100MB, see just 5.9MB and use them without knowing what's behind)
I searched the Internet, but found close to nothing on this specific matter, as I want to keep as little overhead as possible (the current hypervisor implementations don't fit my needs).
What you are looking for already exists, in hardware!
It's called IOMMU[1]. Basically, like page tables, adding a translation layer between the executed instructions and the actual physical hardware.
AMD calls it IOMMU[2], Intel calls it VT-d (please google:"intel vt-d" I cannot post more than two links yet).
[1] http://en.wikipedia.org/wiki/IOMMU
[2] http://developer.amd.com/documentation/articles/pages/892006101.aspx
Here are a few suggestions / hints, which are necessarily somewhat incomplete, as developing a from-scratch hypervisor is an involved task.
Make your hypervisor "multiboot-compliant" at first. This will enable it to reside as a typical entry in a bootloader configuration file, e.g., /boot/grub/menu.lst or /boot/grub/grub.cfg.
You want to set aside your 100MB at the top of memory, e.g., from 5.9GB up to 6GB. Since you mentioned Windows I'm assuming you're interested in the x86 architecture. The long history of x86 means that the first few megabytes are filled with all kinds of legacy device complexities. There is plenty of material on the web about the "hole" between 640K and 1MB (plenty of information on the web detailing this). Older ISA devices (many of which still survive in modern systems in "Super I/O chips") are restricted to performing DMA to the first 16 MB of physical memory. If you try to get in between Windows or Linux and its relationship with these first few MB of RAM, you will have a lot more complexity to wrestle with. Save that for later, once you've got something that boots.
As physical addresses approach 4GB (2^32, hence the physical memory limit on a basic 32-bit architecture), things get complex again, as many devices are memory-mapped into this region. For example (referencing the other answer), the IOMMU that Intel provides with its VT-d technology tends to have its configuration registers mapped to physical addresses beginning with 0xfedNNNNN.
This is doubly true for a system with multiple processors. I would suggest you start on a uniprocessor system, disable other processors from within BIOS, or at least manually configure your guest OS not to enable the other processors (e.g., for Linux, include 'nosmp'
on the kernel command line -- e.g., in your /boot/grub/menu.lst).
Next, learn about the "e820" map. Again there is plenty of material on the web, but perhaps the best place to start is to boot up a Linux system and look near the top of the output 'dmesg'. This is how the BIOS communicates to the OS which portions of physical memory space are "reserved" for devices or other platform-specific BIOS/firmware uses (e.g., to emulate a PS/2 keyboard on a system with only USB I/O ports).
One way for your hypervisor to "hide" its 100MB from the guest OS is to add an entry to the system's e820 map. A quick and dirty way to get things started is to use the Linux kernel command line option "mem=" or the Windows boot.ini / bcdedit flag "/maxmem".
There are a lot more details and things you are likely to encounter (e.g., x86 processors begin in 16-bit mode when first powered-up), but if you do a little homework on the ones listed here, then hopefully you will be in a better position to ask follow-up questions.

Can applications running in ring0 be secure without formal verification?

How can one ensure security without formal verification of a program that runs in ring0? Could a VM be used without differing userspace kernelspace?
The question is slightly confusing, but I'll do my best to answer.
Running any untrusted code in a privileged mode is unlikely to be "secure" in the sense that most people understand it. As you correctly surmise, however, it is possible to use something akin to a virtual machine in order to moderate the actions which an untrusted process can take within that environment. This is the principle upon which modern "hypervisors" operate - access to the hardware (or memory) is moderated by some piece of "monitor" software or hardware.
That said, if you are taking that approach, it's likely to be the case that formal verification of the virtual machine is highly desirable. Otherwise it seems possible that a maliciously constructed program could find a way to escape from the virtual machine, or make the virtual machine behave in undesirable ways.
A reasonable modern approach to this problem is to use proof carrying code, in which a piece of untrusted code carries with it a machine-checkable proof that it behaves according to some security policy. All the host operating system needs to do at that point is to check the proof against the code (a reasonably computationally cheap operation), and then it is safe to run that code without needing to virtualise it or do any runtime checking.

Why does syscall need to switch into kernel mode?

I'm studying for my operating systems final and was wondering if someone could tell me why the OS needs to switch into kernel mode for syscalls?
A syscall is used specifically to run an operating in the kernel mode since the usual user code is not allowed to do this for security reasons.
For example, if you wanted to allocate memory, the operating system is privileged to do it (since it knows the page tables and is allowed to access memory of other processes), but you as a user program should not be allowed to peek or ruin the memory of other processes.
It's a way of sandboxing you. So you send a syscall requesting the operating system to allocate memory, and that happens at the kernel level.
Edit: I see now that the Wikipedia article is surprisingly useful on this
Since this is tagged "homework", I won't just give the answer away but will provide a hint:
The kernel is responsible for accessing the hardware of the computer and ensuring that applications don't step on one another. What would happen if any application could access a hardware device (say, the hard drive) without the cooperation of the kernel?