What makes the firecracker microvm "micro" vs something like qemu? - virtualization

From https://firecracker-microvm.github.io/:
Firecracker is an alternative to QEMU that is purpose-built for running serverless functions and containers safely and efficiently, and nothing more. Firecracker is written in Rust, provides a minimal required device model to the guest operating system while excluding non-essential functionality (only 5 emulated devices are available: virtio-net, virtio-block, virtio-vsock, serial console, and a minimal keyboard controller used only to stop the microVM). This, along with a streamlined kernel loading process enables a < 125 ms startup time and a < 5 MiB memory footprint. The Firecracker process also provides a RESTful control API, handles resource rate limiting for microVMs, and provides a microVM metadata service to enable the sharing of configuration data between the host and guest.
So what is the main thing that makes qemu slower—primarily the device emulation?
And that startup time of 125ms + 5MB is in contrast to...what?

Yes, firecracker boots faster and is lighter than QEMU, the numbers vary (from little to 10x) with the kernel used and options (drivers, devices) given.
There is an older paper on that here: https://dreadl0ck.net/papers/Firebench.pdf – which finds firecracker faster but not impressively so:
In our experiments
the mean kernel boot time of Firecracker microVM is 800ms
in the sequential experiments, and 1000ms in the concurrent
scenario. QEMU boots the Linux kernel 18% slower on
average. […] It is important to note
that the network stack setup during takes additional time,
without initialising the network stack the machine is able to
boot in 150ms-200ms. The reduced boot time of Firecracker
can be explained by the fact that Firecracker only emulates five
devices: virtio-net, virtio-block, virtio-vsock, serial console,
and a minimal keyboard controller used only to stop the
microVM.
But I would evaluate this from another perspective: firecracker is purposefully minimal to present less possibility for configuration mishaps and importantly minimal attack surface (it's usually used to run untrusted workloads). Also full control by ReST-API makes it easy to orchestrate.

Related

System startup of multicore computer

I would really like to know how does a multicore CPU start when the computer starts up. I imagine there is like a "dominant core" that loads the BIOS and later on ther kernel to RAM and wakes up the rest of the cores leaving them waiting for code to run (like an infinite while loop?). But that it's only how I guess it works.
Other question is, after the kernel is loaded on memory all cores can do system calls, right?. And how does one core control the tasks of the other cores? Which instructions are used? (in x86 / x86-64)
Yes there is a boot CPU. The firmware handles that. It's usually CPU 0, but what if that one is missing or defective? Then it gets trickier.
On x86 platforms there's the ACPI tables which describe the CPU and memory layouts. The operating system starts the other CPUs with IPI (inter processor interrupts) which kick them out of idle into the interrupt handlers (which were set in memory) and then into operating system functions. Which then choose threads to run and start doing useful things.
If you really want to know how it all works read the source code for Linux or one of the BSDs.
Update: Looks like I was wrong about IPI. It is using interrupts but not the normal IPI ones. The Linux SMP boot is here: https://github.com/torvalds/linux/blob/master/arch/x86/kernel/smpboot.c
It seems to use NMI or sets the CPU reset.

Mongodb in Docker: numactl --interleave=all explanation

I'm trying to create Dockerfile for in-memory MongoDB based on official repo at https://hub.docker.com/_/mongo/.
In dockerfile-entrypoint.sh I've encountered:
numa='numactl --interleave=all'
if $numa true &> /dev/null; then
set -- $numa "$#"
fi
Basically it prepends numactl --interleave=all to original docker command, when numactl exists.
But I don't really understand this NUMA policy thing. Can you please explain what NUMA really means, and what --interleave=all stands for?
And why do we need to use it to create MongoDB instance?
The man page mentions:
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
This isn't available for all architectures, which is why issue 14 made sure to call numa only on numa machine.
As explained in "Set default numa policy to “interleave” system wide":
It seems that most applications that recommend explicit numactl definition either make a libnuma library call or incorporate numactl in a wrapper script.
The interleave=all alleviates the kind of issue met by app like cassandra (a distributed database for managing large amounts of structured data across many commodity servers):
By default, Linux attempts to be smart about memory allocations such that data is close to the NUMA node on which it runs. For big database type of applications, this is not the best thing to do if the priority is to avoid disk I/O. In particular with Cassandra, we're heavily multi-threaded anyway and there is no particular reason to believe that one NUMA node is "better" than another.
Consequences of allocating unevenly among NUMA nodes can include excessive page cache eviction when the kernel tries to allocate memory - such as when restarting the JVM.
For more, see "The MySQL “swap insanity” problem and the effects of the NUMA architecture"
Without numa
In a NUMA-based system, where the memory is divided into multiple nodes, how the system should handle this is not necessarily straightforward.
The default behavior of the system is to allocate memory in the same node as a thread is scheduled to run on, and this works well for small amounts of memory, but when you want to allocate more than half of the system memory it’s no longer physically possible to even do it in a single NUMA node: In a two-node system, only 50% of the memory is in each node.
With Numa:
An easy solution to this is to interleave the allocated memory. It is possible to do this using numactl as described above:
# numactl --interleave all command
I mentioned in the comments that numa enumerates the hardware to understand the physical layout. And then divides the processors (not cores) into “nodes”.
With modern PC processors, this means one node per physical processor, regardless of the number of cores present.
That is bit of an over-simplification, as Hristo Iliev points out:
AMD Opteron CPUs with larger number of cores are actually 2-way NUMA systems on their own with two HT (HyperTransport)-interconnected dies with own memory controllers in a single physical package.
Also, Intel Haswell-EP CPUs with 10 or more cores come with two cache-coherent ring networks and two memory controllers and can be operated in a cluster-on-die mode, which presents itself as a two-way NUMA system.
It is wiser to say that a NUMA node is some cores that can reach some memory directly without going through a HT, QPI (QuickPath_Interconnect), NUMAlink or some other interconnect.

How are applications and data accessed by the CPU from RAM

I am having a bit of trouble understanding how applications and data are accessed by the CPU from RAM after the application has been loaded into RAM and a file opened (thus data for the file also stored in RAM).
By my understanding, a CPU just gets instructions from RAM as the program counter ticks or carries out tasks after an interrupt. How then does it access the application and data. Is it that it doesn't and still just gets instructions (for example to load a file on the hard drive to be opened in the application) and processes any requests made by the application which are stored in RAM as instructions thereafter (like saving a file). Or does the application and data relating to an opened file (for example) just stay in RAM and not get accessed by the CPU at all.
Similarly, after reading an article, it said that a copy of the operating system is stored in RAM. The CPU can then access the operating system. (I thought the CPU just worked with instructions from RAM). How does it then communicate with the operating system and how are interrupts sent to the CPU, from the copy of the OS in RAM or from the OS in the hard drive.
Sorry if this is really confusing, alot i didn't understand.
Root of your question: Lack of clear differentiation between Computer's Hardware and Computer's Software.
Components of a Computer System
Just so that we are clear about both of them and that we understand their nature, let me state as follows:
Hardware: It includes CPU, RAM, Disk, Register, Graphics Card, Network Card, Memory BUS and everything that you can touch and call to be the 'Computer'. It is the body.
Software: It includes Operating System, Program, CPU instruction, Compiler, Programming Language and almost everything intangible about the computer. It is the soul.
Firmware: It is that basic code which is absolutely essential for hardware's working. This is stored on a Read Only Memory installed in the hardware itself. This piece of software is vital for hardware therefore is considered in the mid of hardware and software and hence called Firmware.
We will start with understanding from the time when we say that the computer is up and running and is properly executing our instructions. But at that time you will say - How did I reach here? So I will mention a few points about the startup of the computer.
When the power button is pressed...
...the most primitive and basic input output system (therefore called BIOS), which is hard written on the computer hardware begins execution. This is written on Read Only Memory and this starts the process to get the machine to stand on its own. And it loads the software (Operating System) from one piece of hardware (disks) into another piece of hardware (RAM and CPU registers) enabling the software to work properly with hardware.
Now the body and soul are together and the individual (machine) can work.
Until now, OS is already in RAM and CPU. (Read When the power button is pressed if you doubt it.) Let's handle your question paragraph by paragraph now -
First Paragraph
I am having a bit of trouble understanding how applications and data
are accessed by the CPU from RAM after the application has been loaded
into RAM and a file opened (thus data for the file also stored in
RAM).
The explanation is as follows:
The exact issue here is your thinking that it is CPU and RAM that access the data. CPU and RAM are only executing units.
It is OS (software) that accesses the data by means of CPU and RAM (hardware). It is in the realm of OS where applications are executed.
This is why you can install Linux and Windows on same hardware but cannot execute .exe files in Linux because OS does the execution and not RAM/CPU.
Further, how do CPU and RAM and disk physically interact to bring in the data, execute it, save it back etc. is in the domain of hardware. That would require explanation which involves logic gates (AND, OR, NOT...), diodes, circuitry and a hell lot of other things which an Electronics guy can explain.
Second Paragraph
By my understanding, a CPU just gets instructions from RAM as the
program counter ticks or carries out tasks after an interrupt. How
then does it access the application and data. Is it that it doesn't
and still just gets instructions (for example to load a file on the
hard drive to be opened in the application) and processes any
requests made by the application which are stored in RAM as
instructions thereafter (like saving a file).
As you have guessed it - CPU doesn't get instructions, Operating System does it through CPU. Also, just the way brain doesn't directly instruct the hands and legs to move and instead uses nerves for interaction, the CPU doesn't tell the disks to give/take the data. CPU works with RAM and registers only. Multiple units of hardware work in conjunction to provide a path for data and instruction to travel. The important pieces of involved hardware are:
Processor (CPU and registers built in the CPU)
Cache
Memory (RAM)
Disk
Tape
I like the image provided in this answer. This image not only lists the hardware pieces but also illustrates the mammoth difference in the execution speed of these pieces.
Let's move on to the...
Third Paragraph
Similarly, after reading an article, it said that a copy of the
operating system is stored in RAM. The CPU can then access the
operating system. (I thought the CPU just worked with instructions
from RAM). How does it then communicate with the operating system and
how are interrupts sent to the CPU, from the copy of the OS in RAM or
from the OS in the hard drive.
By now you already know that indeed OS is present in RAM and CPU registers. That is where it lives. That is from where it tells the CPU how to work. If OS would be small enough (or if Registers and Caches would be big enough), the OS would live even closer to CPU.
The CPU does not communicate with the OS. It can't. It is the worker that is controlled by a boss. OS is that boss.
CPU cannot access Operating System. CPU is the body, OS is the soul. Soul tells the body what to do, not vice-versa.
CPU doesn't work with instructions from RAM. It merely executes the instructions given by the Operating System (which may be living in RAM). So even when there is an instruction to load some module of OS into the RAM, it is not RAM/CPU but OS itself that issues that instruction.
Interrupts are of two types - Hardware and Software - and your query is about the software interrupts. Since the executive part of OS is in the RAM, in simple words we can say that interrupts are sent to CPU from OS living in RAM.
Conclusions
The lack of distinction between hardware and software is the basic cause of your confusions. Take some course about Operating Systems on Coursera or Academic Earth for deeper understanding.
It is confusing indeed. Let me try to explain.
CPU and RAM
The CPU is hardwired to the RAM via the 'motherboard', and they work together. The CPU can perform many instructions, but it has to be told what to do by instructions in RAM. The CPU is basically in a loop: all it does it fetch the next instruction from RAM and execute it, over and over.
So how does this RAM get filled with instructions?
BIOS (basic input/output system)
When the computer first boots up, a portion of RAM is filled with data from a chip on the motherboard (the BIOS chip), and the CPU is turned on and starts processing. These are the factory settings.
The data from the BIOS chip that is copied to RAM consists of a library of instructions to access hardware devices (hard disks, CD/ROM, USB storage, network cards etc.),
and a program using that library to load what is called the bootsector, the first sector on the boot device, into RAM, and transfer control to it (with a jump instruction).
BOOTLOADER
The bootsector data that the BIOS program loaded from the boot device is very small - only 440 bytes - but with the help of the BIOS library, this is enough to be able to load more sectors and execute these. The bootsector and the data it loads is called the bootloader, which is in charge of loading the Operating System.
In effect, the bootloader is a more dynamic version of the BIOS: the BIOS program resides in flash memory, whereas the bootloader resides on hard disks, USB sticks, SSD drives etc., and thus can be larger and more complex.
OPERATING SYSTEM
In it's turn, The operating system (OS) is simply a more advanced version of the bootloader, as it can load and run multiple programs from multiple locations at the same time.
--
The BIOS knows about drives.
The Bootloader knows about drives and partitions.
The OS knows about drives, partitions, and file systems.
CPU,as you've noticed, reads the program from RAM, instruction by instruction. When an instruction is executed, it might refer to data stored in memory, which it either fetches explicitly to the registers (internal storage of the CPU, quite small - on x86_64 that's like several 64-bit registers + other stuff like segment registers, IP, SP etc) with a separate instruction, or the data read from the memory (we are talking about small amount of data). That's all it really does.
Loading a file from a disk would be done by asking the appropriate controller to fetch the data into a specific place in memory. CPU is connected to buses which will carry instructions to appropriate controllers.
As to interrupts these are special things - CPU has several interrupt lines which can be activated by various devices, for example your network card. When it receives such an interrupt, it is usually handled by an interrupt handler, which is just a program located in a well-known place in memory. They can be registered by, for example, operating system. Each interrupt line has its own interrupt handler. When interrupt happens, the CPU saves the current state of the program it happens to be executing, handles interrupt, restores the state and resumes the program.
You seem to be asking about addressing modes. At the risk of gross oversimplification (ignoring caching, segments, and logical memory), memory stored as a sequential array accessed by an integer address.
The CPU has a number of internal storage areas called registers. We will call them R0 to Rn. The processor assigns some registers dedicated purposes. One of those registers is the PC.
One common addressing mode is deferred. I indicate this mode as (Rn). An instruction like this:
MOV (R0), R1
uses the value contained in R0 as a memory address, fetches the value stored that memory location, and stores a copy of that value in R1.
An instruction sequence like this:
MOV (R0), R1
MOV (R2), R3
is stored in memory as data (ignoring protection), code, data, and variables all use the same type of memory. In other words, any memory location can be interpreted as code, data, or variable.
The CPU executes the next instruction located at (PC). After executing the instruction, the CPU automatically increments the PC to point to the next instruction.

can we run kernel as an application, if we can load the kernel program into the address space

It sounds stupid, but is it possible to load any kernel program as an application on already running OS (not as virtual machine).
Like if we load the program into the process address space and run it.
[..] is it possible to load any kernel program as an application on already running OS [..]
No, since a kernel usually contains code to manage system resources, but the host kernel is already managing these. So this either leads to catastrophic failure, or - because we have memory protection and privilege levels and such - to access violations:
As a small example: Probably all kernels need to configure the interrupt service configuration of the underlying hardware (to get a timer tick, for example).
On x86, this is done by creating an interrupt descriptor table and loading the address of that table using the lidt instruction. When issued in an application process (which the host kernel will have running in ring 3, the least privileged level) the processor will refuse to execute that instruction, because it may only be issued in ring 0, and instead generate an general protection fault. The host kernel will be called to handle that situation (because when that kernel started it registered an interrupt descriptor table for exactly that purpose). The only way for the host kernel to react to that situation is to abort the process that caused the access violation, because otherwise it would risk system stability and integrity.
Similar issues arise when dealing with segmentation, paging, accessing memory mapped devices and access to peripherals in general.
That said, it is possible to create a kernel that can be run as a user space process, an example that I personally have worked with is RODOS, which can be run as a Linux process. To make this possible it is necessary to split hardware dependent stuff (which is a large portion) from the independent code (like scheduling, interprocess communication, ...) and provide stubs that reuse functionality of the host operating system to simulate some sort of hardware. (Of course, a such prepared kernel can only run on the host system if it is compiled for that use. You cannot use the same binary as you would use on raw hardware.)

Advice on using hypervisor to run a Real Time OS in parallel with Windows/Linux

What are your advice/experience of using a hypervisor (e.g. RTS Real-Time Hypervisor) to run an RTOS in parallel with a non real time OS. Are there any performance implications? Are there any risks involved? (like how can you ensure that the non-real time OS will not interfere with the real time aspects of the RTOS)
From what I understand, a dual core (or hyperthreading) CPU has to be used so that you can assign each OS its own core.
no, it doesn't need dual core or hyperthreading.
no, the non-RT tasks doesn't interfere with RT ones.
The main idea is to have one RTOS, which executes tasks written specifically for this OS, using it's own API. These tasks are set in string priority levels, where a higher priority task will allways take precedence over a lower priority one. The lowest priority tasks will execute only as long as there's no other task available to run (that is, they're all waiting for some event, either a timeout or an external signal).
all this is just like a usual multitasking OS scheduler, it doesn't need multple cores or hardware threads; it's just that the timing guarantees are radically different, and the available API reflects this fact.
In those hybrid implementations, there's a single lowest-level task that runs a full non-RT OS kernel, usually Linux or some other unix-like kernel (i don't know about windows, but should work the same). Nowadays, we call this architecture a hypervisor.
so, since the whole non-RT OS is run as the lowest-priority task, it doesn't have any guarantee of getting processing time at all. any RT task can interrupt it at any time, even when accessing hardware. to keep this, usually the RT tasks have very limited access to the hardware, or there are minimal arbitrations at very low level. ie: can interrupt a disk access (possibly resulting in a access error); but not a PCI access (as long as are short-lived and time-bounded)
there's also some soft-RT extensions to the Linux scheduler for some time now; but the timing guarantees aren't so tight as some hard-RT OSes built with that in mind.