I am trying to write a very thin hypervisor that would have the following restrictions:
runs only one operating system at a time (ie. no OS concurrency, no hardware sharing, no way to switch to another OS)
it should be able only to isolate some portions of RAM (do some memory translations behind the OS - let's say I have 6GB of RAM, I want Linux / Win not to use the first 100MB, see just 5.9MB and use them without knowing what's behind)
I searched the Internet, but found close to nothing on this specific matter, as I want to keep as little overhead as possible (the current hypervisor implementations don't fit my needs).
What you are looking for already exists, in hardware!
It's called IOMMU[1]. Basically, like page tables, adding a translation layer between the executed instructions and the actual physical hardware.
AMD calls it IOMMU[2], Intel calls it VT-d (please google:"intel vt-d" I cannot post more than two links yet).
[1] http://en.wikipedia.org/wiki/IOMMU
[2] http://developer.amd.com/documentation/articles/pages/892006101.aspx
Here are a few suggestions / hints, which are necessarily somewhat incomplete, as developing a from-scratch hypervisor is an involved task.
Make your hypervisor "multiboot-compliant" at first. This will enable it to reside as a typical entry in a bootloader configuration file, e.g., /boot/grub/menu.lst or /boot/grub/grub.cfg.
You want to set aside your 100MB at the top of memory, e.g., from 5.9GB up to 6GB. Since you mentioned Windows I'm assuming you're interested in the x86 architecture. The long history of x86 means that the first few megabytes are filled with all kinds of legacy device complexities. There is plenty of material on the web about the "hole" between 640K and 1MB (plenty of information on the web detailing this). Older ISA devices (many of which still survive in modern systems in "Super I/O chips") are restricted to performing DMA to the first 16 MB of physical memory. If you try to get in between Windows or Linux and its relationship with these first few MB of RAM, you will have a lot more complexity to wrestle with. Save that for later, once you've got something that boots.
As physical addresses approach 4GB (2^32, hence the physical memory limit on a basic 32-bit architecture), things get complex again, as many devices are memory-mapped into this region. For example (referencing the other answer), the IOMMU that Intel provides with its VT-d technology tends to have its configuration registers mapped to physical addresses beginning with 0xfedNNNNN.
This is doubly true for a system with multiple processors. I would suggest you start on a uniprocessor system, disable other processors from within BIOS, or at least manually configure your guest OS not to enable the other processors (e.g., for Linux, include 'nosmp'
on the kernel command line -- e.g., in your /boot/grub/menu.lst).
Next, learn about the "e820" map. Again there is plenty of material on the web, but perhaps the best place to start is to boot up a Linux system and look near the top of the output 'dmesg'. This is how the BIOS communicates to the OS which portions of physical memory space are "reserved" for devices or other platform-specific BIOS/firmware uses (e.g., to emulate a PS/2 keyboard on a system with only USB I/O ports).
One way for your hypervisor to "hide" its 100MB from the guest OS is to add an entry to the system's e820 map. A quick and dirty way to get things started is to use the Linux kernel command line option "mem=" or the Windows boot.ini / bcdedit flag "/maxmem".
There are a lot more details and things you are likely to encounter (e.g., x86 processors begin in 16-bit mode when first powered-up), but if you do a little homework on the ones listed here, then hopefully you will be in a better position to ask follow-up questions.
Related
I'm interested in the details of how operating systems work and perhaps writing my own.
From what I've gathered, the BIOS/UEFI is supposed to handle setting up the hardware, and do things like memory-mapping (or IO ports for) the graphics card and other IO devices like audio and ethernet.
My question is, how does the kernel know how to access and (re)configure these devices when it's passed control from the bootloader? Are there just conventions like 'the graphics card is always memory mapped from X to Y address space'? Are you at the mercy of a hardware manufacturer writing a driver for an operating system which knows how the hardware will be initialized?
That seems wrong, so maybe the kernel code includes instructions which somehow iterate through all the bus-connected devices. But what instructions can accomplish that? Is the PCI(e) controller also a memory-mapped device? How do you begin querying and setting up the system?
My primary reference has been the Intel 64 Architectures Software Developer's Manual, which has excellent documentation on how the CPU works, but doesn't describe how the system is setup.
I never wrote a firmware so I don't really know how that works in general. You probably have some memory detection done like an actual memory iteration that is done and some interrogation of PCI devices that are memory mapped in RAM. You also probably have some information in the Developer's manuals as to how you should get some information about memory and stuff like that.
In the end, the kernel doesn't need to bother about that because the firmware takes care to do all that and to provide temporary drivers before the kernel is set up completely.
The firmware passes information to the kernel using the ACPI tables so that is the convention you are looking for. The UEFI firmware launches the /efi/boot/bootx64.efi EFI app from the hard-disk automatically. It calls the main function of that app often called the bootloader. When you write that application, often with frameworks such as EDK2 or GNU-EFI, you can thus use the temporary drivers to get some information like the location of the RSDP which points to all other ACPI tables.
The ACPI convention specifies a language that is AML which, when your kernel interprets, tells it all about hardware. You thus have all the required information there to load drivers and such.
PCI (which is everything nowadays) is another thing. It works with memory mapped IO but the ACPI tables (the MCFG) is helpful to find the beginning of the configuration space for PCI devices that take the form of memory mapped registers.
As to graphics cards, you probably don't want to start with those. They are complex and, at first, you should probably stick to the framebuffer returned by UEFI and at least write a driver for xHCI which is the PCI host controller responsible to interact with USB including keyboards and mouses.
since we know that operating system works in protected mode
and BIOS works in a Real mode ( 16 bit). so when an interrupt is called from
either Operating System or application program, does CPU switch mode back and forth everytime?
In general; hardware is capable of doing lots of things at once (playing sounds while generating 3D graphics while sending data to network while transferring data to multiple disks while waiting for user input while all CPUs are busy doing actual processing); and BIOS functions are not capable of allowing more than one thing to happen at time (e.g. will waste 100% of CPU time waiting for a hard disk controller to transfer data while the CPU does nothing and while nothing else can use any other BIOS service for anything).
For this reason alone, BIOS services are not usable and not used by any modern OS (except briefly during boot).
Of course it's not the only reason - no IO prioritisation, no support for hot-plug of any kind, no support for power management, no support for system management (e.g. https://en.wikipedia.org/wiki/S.M.A.R.T. ), no support for GPU, no support for any sound card (other than "PC speaker beep"), no support for networking (excluding PXE), no support for IO APICs, etc. Ironically, out of all of the problems, "BIOS services are designed for real mode" is the least important problem because it's the only problem that an OS can work around.
Instead, each OS has native drivers that don't have any of these limitations.
Note: this is partly why it's relatively easy for modern operating systems to support UEFI (where none of the BIOS services exist at all) - for most, they replace the boot loader/boot code and don't need to change much else.
I imagine this would be accomplished by assigning some RAM and L3 cache to one OS, some to another, and having two hard drives and two monitors. I don't know if its possible to do that at all, and if it is, how? A wrapper OS? Are there any functional examples?
I know that most advantages of such a system can be acquired by virtualization, but that is different than what I mean.
Theoretically It is possible to have multiple operating system running on a single machine but on different cores. Like one core will be running windows and other will be running linux distro. Though It is very hard to achieve, because both OS assume that It is the only king of the island, and tries to rule on everything like memory and devices. Eventually without having any exclusive lock, both OS will confuse hardware or itself and crash.
Let's come to the point that How is it even possible theoretically? This is possible through asymmetric multiprocessing (AMP), Before executing operating system A you hide the 2nd core so that OS will assume that there is only one core present on the machine then after OS will setup environment for that core.
Once things are ready this side, You ask operating system B to load up on 2nd core by hiding first core. And yes you need a separate program except boot loader to do all of this work.
Now you have two OS running but what about memory? devices? Yes that's a major concern. One workaround that I could see is to modify kernel of OS A and OS B such that you could properly divide system resources. Like you tell the OS A to use lower 2GB memory to use and assume upper 2GB as not available, thus modify OS B to use upper 2GB memory.
Memory concern is resolved, but than It would little tricky to modify every device driver to do that.
I guess this is the only reason for not doing such kind of experiment. It isn't worthy at all.
Outside of virtualization, it would really not be possible to do this on any current processors.
When the processor receives an interrupt, what operating system handles it?
In order to be executed by the cpu, a program must be loaded into RAM. A program is just a sequence of machine instructions (like the x86 instruction set) that a processor can understand (because it physically implements their semantic through logic gates).
I can more or less understand how a local instruction (an instruction executed inside the cpu chipset) such as 'ADD R1, R2, R3' works. Even how the cpu interfaces with the ram through the northbridge chipset using the data bus and the address bus is clear enough to me.
What I am struggling with is the big picture.
For example how can a file be saved into an hard disk?
Let's say that the motherboard uses a SATA interface to communicate with the HDD.
Does this mean that this SATA interface has an instruction set which can be used by the cpu by preparing SATA instructions written in the correct format?
Does the same apply with the PCI interface, the AGP interface and so on?
Is all the hardware communication basically accomplished through determining a stardard interface for some task and implementing it (by the companies that create hardware chipsets) with an instruction set that any other hardware components can query?
Is my high level understanding of hardware and software interaction correct?
Nearly. It's actually more general than an instruction.
Alot of these details are architecture specific, so I will stick to a high level general overview of how this can be done.
The CPU can read and write to RAM without a problem correct? You can issue instructions that read and write to any memory address. So rather than try to extend the CPU to understand every possible hardware interface out there, hardware manufacturers simply map sections of the address space (where RAM normally would be) to hardware.
Say for example you want to save a file to a hard drive. This is a possible sequence of command that would occur:
The command register of the hard drive controller is address 0xF00, an address that is outside of RAM but accessible to the CPU
Write the instruction to the command register that indicates we want to write to the hard drive.
There could be conceivably an address register at 0xF01 that tells the hard drive controller where to save the data
Tell the hard drive controller that the data I want to write is at some address in RAM, and initiate the write sequence.
There are a myriad of other ways this can be conceivably be done, but the important thing to note is that it is simply using the instructions that the CPU already has for using RAM.
All of this can be done by the CPU without any special instructions on the CPU side, just read and write to an address. You can imagine this being extended, there is a special place in the address space for the USB controller that contains a list of USB devices, there is a special place for the PCI device list, each PCI devices has several registers that can be read and written to instruct them to do things.
Essentially the role of a device driver is to know how these special registers are to be read and written, what kind of commands devices can accept, etc. Often, as is the case with many graphics cards, what these registers do is not documented to the public and so we rely on their drivers to run the cards correctly.
Could you please explain what the NX flag is and how it works (please be technical)?
It marks a memory page non-executable in the virtual memory system and in the TLB (a structure used by the CPU for resolving virtual memory mappings). If any program code is going to be executed from such page, the CPU will fault and transfer control to the operating system for error handling.
Programs normally have their binary code and static data in a read-only memory section and if they ever try to write there, the CPU will fault and then the operating-system normally kills the application (this is known as segmentation fault or access violation).
For security reasons, the read/write data memory of a program is usually NX-protected by default. This prevents an attacker from supplying some application his malicious code as data, making the application write that to its data area and then having that code executed somehow, usually by a buffer overflow/underflow vulnerability in the application, overwriting the return address of a function in stack with the location of the malicious code in the data area.
Some legitimate applications (most notably high-performance emulators and JIT compilers) also need to execute their data, as they compile the code at runtime, but they specifically allocate memory with no NX flag set for that.
From Wikipedia
The NX bit, which stands for No
eXecute, is a technology used in CPUs
to segregate areas of memory for use
by either storage of processor
instructions (or code) or for storage
of data, a feature normally only found
in Harvard architecture processors.
However, the NX bit is being
increasingly used in conventional von
Neumann architecture processors, for
security reasons.
An operating system with support for
the NX bit may mark certain areas of
memory as non-executable. The
processor will then refuse to execute
any code residing in these areas of
memory. The general technique, known
as executable space protection, is
used to prevent certain types of
malicious software from taking over
computers by inserting their code into
another program's data storage area
and running their own code from within
this section; this is known as a
buffer overflow attack.
Have a look at this 'DEP' found on wikipedia which uses the NX bit. As for supplying the technical answer, sorry, I do not know enough about this but to quote:
Data Execution Prevention (DEP) is a security feature included in modern
Microsoft Windows operating systems that is intended to prevent an
application or service from executing code from a non-executable memory region.
....
DEP was introduced in Windows XP Service Pack 2 and is included in Windows XP
Tablet PC Edition 2005, Windows Server 2003 Service Pack 1 and later, Windows
Vista, and Windows Server 2008, and all newer versions of Windows.
...
Hardware-enforced DEP enables the NX bit on compatible CPUs, through the
automatic use of PAE kernel in 32-bit Windows and the native support on 64-bit
kernels.
Windows Vista DEP works by marking certain parts of memory as being intended to
hold only data, which the NX or XD bit enabled processor then understands as
non-executable.
This helps prevent buffer overflow attacks from succeeding. In Windows Vista,
the DEP status for a process, that is, whether DEP is enabled or disabled for a
particular process can be viewed on the Processes tab in the Windows Task
Manager.
See also here from the MSDN's knowledge base about DEP. There is a very detailed explanation here on how this works.
Hope this helps,
Best regards,
Tom.