can we run kernel as an application, if we can load the kernel program into the address space - operating-system

It sounds stupid, but is it possible to load any kernel program as an application on already running OS (not as virtual machine).
Like if we load the program into the process address space and run it.

[..] is it possible to load any kernel program as an application on already running OS [..]
No, since a kernel usually contains code to manage system resources, but the host kernel is already managing these. So this either leads to catastrophic failure, or - because we have memory protection and privilege levels and such - to access violations:
As a small example: Probably all kernels need to configure the interrupt service configuration of the underlying hardware (to get a timer tick, for example).
On x86, this is done by creating an interrupt descriptor table and loading the address of that table using the lidt instruction. When issued in an application process (which the host kernel will have running in ring 3, the least privileged level) the processor will refuse to execute that instruction, because it may only be issued in ring 0, and instead generate an general protection fault. The host kernel will be called to handle that situation (because when that kernel started it registered an interrupt descriptor table for exactly that purpose). The only way for the host kernel to react to that situation is to abort the process that caused the access violation, because otherwise it would risk system stability and integrity.
Similar issues arise when dealing with segmentation, paging, accessing memory mapped devices and access to peripherals in general.
That said, it is possible to create a kernel that can be run as a user space process, an example that I personally have worked with is RODOS, which can be run as a Linux process. To make this possible it is necessary to split hardware dependent stuff (which is a large portion) from the independent code (like scheduling, interprocess communication, ...) and provide stubs that reuse functionality of the host operating system to simulate some sort of hardware. (Of course, a such prepared kernel can only run on the host system if it is compiled for that use. You cannot use the same binary as you would use on raw hardware.)

Related

What makes the firecracker microvm "micro" vs something like qemu?

From https://firecracker-microvm.github.io/:
Firecracker is an alternative to QEMU that is purpose-built for running serverless functions and containers safely and efficiently, and nothing more. Firecracker is written in Rust, provides a minimal required device model to the guest operating system while excluding non-essential functionality (only 5 emulated devices are available: virtio-net, virtio-block, virtio-vsock, serial console, and a minimal keyboard controller used only to stop the microVM). This, along with a streamlined kernel loading process enables a < 125 ms startup time and a < 5 MiB memory footprint. The Firecracker process also provides a RESTful control API, handles resource rate limiting for microVMs, and provides a microVM metadata service to enable the sharing of configuration data between the host and guest.
So what is the main thing that makes qemu slower—primarily the device emulation?
And that startup time of 125ms + 5MB is in contrast to...what?
Yes, firecracker boots faster and is lighter than QEMU, the numbers vary (from little to 10x) with the kernel used and options (drivers, devices) given.
There is an older paper on that here: https://dreadl0ck.net/papers/Firebench.pdf – which finds firecracker faster but not impressively so:
In our experiments
the mean kernel boot time of Firecracker microVM is 800ms
in the sequential experiments, and 1000ms in the concurrent
scenario. QEMU boots the Linux kernel 18% slower on
average. […] It is important to note
that the network stack setup during takes additional time,
without initialising the network stack the machine is able to
boot in 150ms-200ms. The reduced boot time of Firecracker
can be explained by the fact that Firecracker only emulates five
devices: virtio-net, virtio-block, virtio-vsock, serial console,
and a minimal keyboard controller used only to stop the
microVM.
But I would evaluate this from another perspective: firecracker is purposefully minimal to present less possibility for configuration mishaps and importantly minimal attack surface (it's usually used to run untrusted workloads). Also full control by ReST-API makes it easy to orchestrate.

Hoe does a bare metal hypervisor and the operating system it hosts cooridinate on system calls?

I have read a great deal about bare metal hypervisors, but never quite get the way they interact with an OS they are hosting.
Suppose you have Unix itself on bare metal. When in user mode, you can't touch or affect the OS internals. You get things done by a system call that gets trapped, sets the machine to kernel mode, then does the job for you. For example, in C you might malloc() a bunch, then eventually run out of initially allocated memory. If memory serves me right, malloc - when it knows it is out of memory - must make the system call to what I believe is break(). Once in kernel mode, your process's page table can be extended, then it returns and malloc() has the required extra memory (or something like that).
But if you have Unix on top of a bare metal hypervisor, how does this actually happen? The hypervisor, it would seem, must have the actual page tables for the whole system (across OSs, even). So Unix can't be in kernel mode when a system call to Unix gets made, otherwise it could mess with other OSs running at the same time. On the other hand, if it is running in User mode, how would the code that implements break ever let the hypervisor know it wants more memory without the Unix code being rewritten?
In most architectures another level is added beyond supervisor, and supervisor is somewhat degraded. The kernel believes itself to be in control of the machine, but that is an illusion crafted by the hypervisor.
In ARM, user mode is 0, system is 1, hypervisor is 2. Intel were a bit short sighted (gasp) and had user as 3, supervisor as 0, thus hypervisor is a sort of -1. Obviously its not -1, but that is a handy shorthand to the intensely ugly interface they constructed for this handling.
In most architectures, the hypervisor gets to install an extra set of page tables which take effect after then guest's page tables do. So, your unix kernel thinks it was loaded at 1M physical, could be at any arbitrary address, and every address your unix kernel thinks is contiguous at a page boundary could be scatter over a vast set of actual (bus) addresses.
Even if your architecture doesn't permit an extra level of page tables, it is straightforward enough for a hypervisor to "trap & emulate" the page tables constructed by the guest, and maintain an actual set in a completely transparent fashion. The continual motion towards longer pipelines, however, increases the cost of each trap, thus an extra level page table is much appreciated.
So, your UNIX thinks it has all 8M of memory to itself; however unbeknownst to it, a sneaky hypervisor may be paging that 8M to a really big floppy drive, and only giving it a paltry 640K of real RAM. All the normal unix-y stuff works fine, except that it may have a pretty trippy sense of time, where time slows and speeds up in alternating phases, as they hypervisor attempts to pretend that a 250msec floppy disk access completed in the time of a 60nsec dram access.
This is where hypervisors get hard.

Location of OS Kernel Data

I'm a beginner with operating systems, and I had a question about the OS Kernel.
I'm used to the standard notion of each user process having a virtual address space of stack, heap, data, and code. My question is that when a context switch occurs to the OS Kernel, is the code run in the kernel treated as a process with a stack, heap, data, and code?
I know there is a dedicated kernel stack, which the user program can't access. Is this located in the user program address space?
I know the OS needs to maintain some data structures in order to do its job, like the process control block. Where are these data structures located? Are they in user-program address spaces? Are they in some dedicated segment of memory for kernel data structures? Are they scattered all around physical memory wherever there is space?
Finally, I've seen some diagrams where OS code is located in the top portion of a user program's address space. Is the entire OS kernel located here? If not, where else does the OS kernel's code reside?
Thanks for your help!
Yes, the kernel has its own stack, heap, data structures, and code separate from those of each user process.
The code running in the kernel isn't treated as a "process" per se. The code is privileged meaning that it can modify any data in the kernel, set privileged bits in processor registers, send interrupts, interact with devices, execute privileged instructions, etc. It's not restricted like the code in a user process.
All of kernel memory and user process memory is stored in physical memory in the computer (or perhaps on disk if data has been swapped from memory).
The key to answering the rest of your questions is to understand the difference between physical memory and virtual memory. Remember that if you use a virtual memory address to access data, that virtual address is translated to a physical address before the data is fetched at the determined physical address.
Each process has its own virtual address space. This means that some virtual address a in one process can map to a different physical address than the same virtual address a in another process. Virtual memory has many important uses, but I'm not going to go into them here. The important point is that virtual memory enforces memory isolation. This means that process A cannot access the memory of process B. All of process A's virtual addresses map to some set of physical addresses and all of process B's virtual addresses map to a different set of physical addresses. As long as the two sets of physical addresses do not overlap, the processes cannot see or modify the memory of each other. User processes cannot access physical memory addresses directly - they can only make memory accesses with virtual addresses.
There are times when two processes may have some virtual addresses that do map to the same physical addresses, such as if they both mmap the same file, both use a shared library, etc.
So now to answer your question about kernel address spaces and user address spaces.
The kernel can have a separate virtual address space from each user process. This is as simple as changing the page directory pointer in the cr3 register (in an x86 processor) on each context switch. Since the kernel has a different virtual address space, no user process can access kernel memory as long as none of the kernel's virtual memory addresses map to the same physical addresses as any of the virtual addresses in any address space for a user process.
This can lead to a minor problem. If a user process makes a system call and passes a pointer as a parameter (e.g. a pointer to a buffer in the read system call), how does the kernel know which physical address corresponds to that buffer? The virtual address in the pointer maps to a different physical address in kernel space, so the kernel cannot just dereference the pointer. There are two options:
The kernel can traverse the user process page directory/tables to find the physical address that corresponds to the buffer. The kernel can then read/write from/to that physical address.
The kernel can instead include all of its mappings in the user address space (at the top of the user address space, as you mentioned). Now, when the kernel receives a pointer through the system call, it can just access the pointer directly since it is sharing the address space with the process.
Kernels generally go with the second option, since it's more convenient and more efficient. Option 1 is less efficient because each time a context switch occurs, the address space changes, so the TLB needs to be flushed and now you lose all of your cached mappings. I'm simplifying things a bit here since kernels have started doing things differently given the recent Meltdown vulnerability discovered.
This leads to another problem. If the kernel includes its mappings in the user process address space, what stops the user process from accessing kernel memory? The kernel sets protection bits in the page table that cause the processor to prohibit the user process from accessing the virtual addresses that map to physical addresses that contain kernel memory.
Take a look at these slides for more information.
I'm used to the standard notion of each user process having a virtual address space of stack, heap, data, and code. My question is that when a context switch occurs to the OS Kernel, is the code run in the kernel treated as a process with a stack, heap, data, and code?
One every modern operating system I am aware there is NEVER a context switch to the kernel. The kernel executes in the context of a process (some systems user the fiction of a reduced process context.
The "kernel" executes when a process enters kernel mode through an exception or an interrupt.
Each process (thread) normally has its own kernel mode stack used after an exception. Usually there is a single single interrupt stack for each processor.
https://books.google.com/books?id=FSX5qUthRL8C&pg=PA322&lpg=PA322&dq=vax+%22interrupt+stack%22&source=bl&ots=CIaxuaGXWY&sig=S-YsXBR5_kY7hYb6F2pLGjn5pn4&hl=en&sa=X&ved=2ahUKEwjrgvyX997fAhXhdd8KHdT7B8sQ6AEwCHoECAEQAQ#v=onepage&q=vax%20%22interrupt%20stack%22&f=false
I know there is a dedicated kernel stack, which the user program can't access. Is this located in the user program address space?
Each process has its own kernel stack. It is often in the user space with protected memory but could be in the system space. The interrupt stack is always in the system space.
Where are these data structures located? Are they in user-program address spaces?
They are generally in the system space. However, some systems do put some structures in the user space in protected memory.
Are they in some dedicated segment of memory for kernel data structures?
If they are in the user space, they are generally for an access mode more privileged than user mode and less privileged than kernel mode.
Are they scattered all around physical memory wherever there is space?
Thinks can be spread over physical memory pretty much at random.
The data structures in questions are usually regular C structures situated in the RAM allotted to the kernel by the kernel allocator
They are not usually accessible from regular processes becuase of normal mechanisms for memory protection and paging (virtual memory)
A kind of exception to this are kernel threads which have no userspace address space so the code they execute is always the kernel code working with the kernel space data structures hence with the isolated kernel memory
Now for the interesting part: 64-bit Linux uses a thing called Direct Map for memory organization, which means that the full amount of physical memory available is mapped in the kernel page tables as just one contiguous chunk. This is not true for 32-bit as the HIGHMEM was used to avoid the limitation of 4GB address spaces
Since the kernel has all the physical RAM visible and available to its own allocator, the kernel data structures in question can be situated pretty randomly with respect to the physical addresses
You can google on there terms to gain additional information:
PTI (page table isolation)
__copy_from_user (esp. on esoteric architectures where this function is not just a bitwise copy)
EPT (Intel nested paging in virtual machines)

minimal number of privileged instructions?

minimal number of privileged instructions?
Say we want to write a OS with minimal number of privileged instruction.
I think it should be 1, only MMU register. But what about other things? i.e mode bit, trap
Well you can implement an operating system with everything in system mode, and you could argue that there are no "privileged" instructions.
As to whether you could implement an OS with privileged and non-privileged modes using N different privileged instructions:
it would depend on the functionality you aimed to implement,
it would depend on the hardware instruction set, MMU design, etcetera, and
unless you were prepared to put months / years into a theoretical analysis, it would be a matter of debate / opinion as to whether your proposed answer was indeed correct.
Operating System need to provide security (including memory isolation from different programs) and abstraction (each program doesn't need to care how much memory is available on physical memory).
To maintain these: you need at least 1 privileged instruction.
The privilege instruction is to setup Memory Management Unit registers so that you can ensure the memory is protected. There should be no IO instructions and all IO and interrupt access should be memory mapped.
Use MMU to ensure kernel memory, kernel code, "memory of interrupt access" and "device's memory mapped IO interface" are not mapped to user space, so user processes cannot access these. These memories lies in kernel memory.

How exactly does OS protect kernel

my question is how exactly does operating system protect it's kernel part.
From what I've found there are basically 2 modes kernel and user. And there should be some bits in memory segments which tels if a memory segment is kernel or user space segment. But where is the origin of those bits? Is there some "switch" in compiler that marks programs as kernel programs? And for example if driver is in kernel mode how does OS manages its integration to system so there is not malicious software added as a driver?
If someone could enlighten me on this issue, I would be very grateful, thank you
The normal technique is by using a feature of the virtual memmory manager present in most modern cpus.
The way that piece of hardware works is that it keeps a list of fragments of memory in a cache, and a list of the addresses to which they correspond. When a program tries to read some memory that is not present in that cache, the MMU doesn't just go and fetch the memory from main ram, because the addresses in the cacher are only 'logical' addresses. Instead, it invokes another program that will interpret the address and fetch that memory from wherever it should be.
That program, called a pager, is supplied by the kernel, and special flags in the MMU prevent that program from being overridden.
If that program determines that the address corresponds to memory the process should get to use, it supplies the MMU with the physical address in main memory that corresponds to the logical address the user program asked for, the MMU fetches it into its cache, and resumes running the user program.
If that address is a 'special' address, like for a memory mapped file, then the kernel fetches the corresponding part of the file into the cache and lets the program run along with that.
If the address is in the range that belongs to the kernel, or that the program hasn't allocated that address to itself yet, the pager raises a SEGFAULT, killing the program.
Because the addresses are logical addresses, not physical addresses, different user programs may use the same logical addresses to mean different physical addresses, the kernel pager program and the MMU make this all transparent and automatic.
This level of protection is not available on older CPU's (like 80286 cpus) and some very low power devices (like ARM CortexM3 or Attiny CPUs) because there is no MMU, all addresses on these systems are physical addresses, with a 1 to 1 correspondence between ram and address space
The “switch” is actually in the processor itself. Some instructions are only available in kernel mode (a.k.a. ring 0 on i386). Switching from kernel mode to user mode is easy. However, there are not so many ways to switch back to kernel mode. You can either:
send an interrupt to the processor
make a system call.
In either case, the operation has the side effect of transferring the control to some trusted, kernel code.
When a computer boots up, it starts running code from some well known location. That code ultimately ends up loading some OS kernel to memory and passing control to it. The OS kernel then sets up the CPU memory map via some CPU specific method.
And for example if driver is in kernel mode how does OS manages its integration to system so there is not malicious software added as a driver?
It actually depends on the OS architecture. I will give you two examples:
Linux kernel: A driver code can be very powerful. The level of protections are following:
a) A driver is allowed to access limited number of symbols in the kernel, specified using EXPORT_SYMBOL. The exported symbols are generally functions. But nothing prevents a driver from trashing a kernel using wild pointers. And the security using EXPORT_SYMBOL is nominal.
b) A driver can only be loaded by the privileged user who has root permission on the box. So as long as root privileges are not breached system is safe.
Micro kernel like QNX: The operating system exports enough interface to the user so that a driver can be implemented as a user space program. Hence the driver at least cannot easily trash the system.