Why does syscall need to switch into kernel mode?

Why does syscall need to switch into kernel mode? - operating-system

I'm studying for my operating systems final and was wondering if someone could tell me why the OS needs to switch into kernel mode for syscalls?

A syscall is used specifically to run an operating in the kernel mode since the usual user code is not allowed to do this for security reasons.
For example, if you wanted to allocate memory, the operating system is privileged to do it (since it knows the page tables and is allowed to access memory of other processes), but you as a user program should not be allowed to peek or ruin the memory of other processes.
It's a way of sandboxing you. So you send a syscall requesting the operating system to allocate memory, and that happens at the kernel level.
Edit: I see now that the Wikipedia article is surprisingly useful on this

Since this is tagged "homework", I won't just give the answer away but will provide a hint:
The kernel is responsible for accessing the hardware of the computer and ensuring that applications don't step on one another. What would happen if any application could access a hardware device (say, the hard drive) without the cooperation of the kernel?

Related

How does the kernel initialize and access the rest of the hardware on x86/64?

I'm interested in the details of how operating systems work and perhaps writing my own.
From what I've gathered, the BIOS/UEFI is supposed to handle setting up the hardware, and do things like memory-mapping (or IO ports for) the graphics card and other IO devices like audio and ethernet.
My question is, how does the kernel know how to access and (re)configure these devices when it's passed control from the bootloader? Are there just conventions like 'the graphics card is always memory mapped from X to Y address space'? Are you at the mercy of a hardware manufacturer writing a driver for an operating system which knows how the hardware will be initialized?
That seems wrong, so maybe the kernel code includes instructions which somehow iterate through all the bus-connected devices. But what instructions can accomplish that? Is the PCI(e) controller also a memory-mapped device? How do you begin querying and setting up the system?
My primary reference has been the Intel 64 Architectures Software Developer's Manual, which has excellent documentation on how the CPU works, but doesn't describe how the system is setup.

I never wrote a firmware so I don't really know how that works in general. You probably have some memory detection done like an actual memory iteration that is done and some interrogation of PCI devices that are memory mapped in RAM. You also probably have some information in the Developer's manuals as to how you should get some information about memory and stuff like that.
In the end, the kernel doesn't need to bother about that because the firmware takes care to do all that and to provide temporary drivers before the kernel is set up completely.
The firmware passes information to the kernel using the ACPI tables so that is the convention you are looking for. The UEFI firmware launches the /efi/boot/bootx64.efi EFI app from the hard-disk automatically. It calls the main function of that app often called the bootloader. When you write that application, often with frameworks such as EDK2 or GNU-EFI, you can thus use the temporary drivers to get some information like the location of the RSDP which points to all other ACPI tables.
The ACPI convention specifies a language that is AML which, when your kernel interprets, tells it all about hardware. You thus have all the required information there to load drivers and such.
PCI (which is everything nowadays) is another thing. It works with memory mapped IO but the ACPI tables (the MCFG) is helpful to find the beginning of the configuration space for PCI devices that take the form of memory mapped registers.
As to graphics cards, you probably don't want to start with those. They are complex and, at first, you should probably stick to the framebuffer returned by UEFI and at least write a driver for xHCI which is the PCI host controller responsible to interact with USB including keyboards and mouses.

Who decides which instructions are to be kept privileged? Is it the hardware manufacturer or the OS developers

I read that there are some privileged instructions in our system that can be executed in kernel mode. But I am unable to understand who make these instructions privileged . Is it the hardware manufacturer that hardwire some harmful instructions as privileged with the help of mode bit or is it the OS designers that make instructions privileged make them work only in privileged mode.

Kernel vs. user mode, and which instructions aren't allowed in user mode, is part of the ISA. That's baked in to the hardware.
CPU architects usually have a pretty good idea of what OSes need to do and want to prevent user-space from doing, so these choices at least make privilege levels possible, i.e. make it impossible for user-space to simply take over the machine.
But that's not the full picture: on some ISAs, such as x86, later ISA extensions have added control-register flag bits that let the OS choose whether some other instructions are privileged or not. On x86 that's done for instructions that could leak information about kernel ASLR, or make timing side-channels easier.
For example, rdpmc (read performance monitor counter) can only be used from user-space if specially enabled by the kernel. rdtsc (Read TimeStamp Counter) can be read from user-space by default, but the TSD (TimeStamp Disable) flag in CR4 can restrict its use to priv level 0 (kernel mode). Stopping user-space from using high-resolution timing is a brute-force way of defending against timing side-channel attacks.
Another x86 extension defends against leaking kernel addresses to make kernel ASLR more secret; CR4.UMIP (User Mode Instruction Prevention) disables instructions like sgdt that reads the virtual address of the GDT. Those instructions were basically useless for user-space in the first place, and unlike rdtsc easily could always have been privileged.
The Linux Kernel option to enable use of this extension describes it:
The User Mode Instruction Prevention (UMIP) is a security feature in newer Intel processors. If enabled, a general protection fault is issued if the SGDT, SLDT, SIDT, SMSW or STR instructions are executed in user mode. These instructions unnecessarily expose information about the hardware state.
The vast majority of applications do not use these instructions. For the very few that do, software emulation is provided in specific cases in protected and virtual-8086 modes. Emulated results are dummy.
Setting a new address for the IDT/GDT/LDT (e.g. lgdt/lidt) is of course a privileged instruction; those let you take over the machine. But until kernel ASLR was a thing, there wasn't any reason to stop user-space from reading the address. It could be in a page that had its page-table entry set to kernel only, preventing user-space from doing anything with that address. (... until Meltdown made it possible for user-space to use a speculative side-channel to read data from kernel-only pages that were hot in cache.)

Do CPU and main memory need drivers to work?

Peripheral devices require drivers to work in a computer system (operating system).
Does a CPU need a driver to work?
Same question for a main memory?

The answer is no.
The reason is that the motherboard comes with an (upgradable) BIOS, which takes care of making sure the CPU features function correctly (obviously, an AMD processor won't work on an Intel motherboard). You can upgrade the BIOS, but that should be avoided until, ... reasons of course.
Same goes for memory, it does not require a driver either.
Just so that you know, if you ever tried overclocking you can notice that you can alter the way the RAM functions, ganged/unganged mods and so on. My point is that there is already an interface established using code allowing you to make changes in real time, isn't that the very purpose we even have drivers, to be able to use a peripheral with expected outcome.
On the other hand, peripheral devices are just extensions, which the motherboard does not know how to handle, hence needing a set of instructions i.e. drivers.

In a modern system both memory and the CPU require kernel mode code — as do devices — to function.
Memory requires management of virtual memory tables. The CPU requires maintenance of process control structures.
In the business, such code is not called a "driver".
Generally, one thinks of a device driver as being kernel mode code that responds to devices through the interrupt vector.
That said, on some systems there are "printer drivers" that do not fit that definition of driver.
In short, do memory and CPU have something called a "driver"? No.
Do they have something analogous to a driver? Yes.

How do computers prevent programs from interfering with each other?

For example, I heard in class that global variables are just put in a specific location in memory. What is to prevent two programs from accidentally using the same memory location for different variables?
Also, do both programs use the same stack for their arguments and local variables? If so, what's to prevent the variables from interleaving with each other and messing up the indexing?
Just curious.

Most modern processors have a memory management unit (MMU) that provide the OS the ability to create protected separate memory sections for each process including a separate stack for each process. With the help of the MMU the processor can restrict each process to modifying / accessing only memory that has been allocated to it. This prevents one process from writing into a another processes memory space.
Most modern operating systems will use the features of the MMU to provide protection for each process.
Here are some useful links:
Memory Management Unit
Virtual Memory

This is something that modern operating systems do by loading each process in a separate virtual address space. Multiple processes may reference the same virtual address, but the operating system, helped by modern hardware, will map each one to a separate physical address, and make sure that one process cannot access physical memory allocated to another process1.
1 Debuggers are a notable exception: operating system often provide special mechanisms for debuggers to attach to other processes and examine their memory space.

The short answer to your question is that the operating system deals with these issues. They are very serious issues, and a significant percentage of an operating systems job is keeping everything in a separate space. The operating system runs programs that track all the other programs and make sure they are each using a space. This keeps the stacks separate too. Each program is running its own stack assigned by the OS. How the OS does this assigning is actually a complex task.

OS memory isolation

I am trying to write a very thin hypervisor that would have the following restrictions:
runs only one operating system at a time (ie. no OS concurrency, no hardware sharing, no way to switch to another OS)
it should be able only to isolate some portions of RAM (do some memory translations behind the OS - let's say I have 6GB of RAM, I want Linux / Win not to use the first 100MB, see just 5.9MB and use them without knowing what's behind)
I searched the Internet, but found close to nothing on this specific matter, as I want to keep as little overhead as possible (the current hypervisor implementations don't fit my needs).

What you are looking for already exists, in hardware!
It's called IOMMU[1]. Basically, like page tables, adding a translation layer between the executed instructions and the actual physical hardware.
AMD calls it IOMMU[2], Intel calls it VT-d (please google:"intel vt-d" I cannot post more than two links yet).
[1] http://en.wikipedia.org/wiki/IOMMU
[2] http://developer.amd.com/documentation/articles/pages/892006101.aspx

Here are a few suggestions / hints, which are necessarily somewhat incomplete, as developing a from-scratch hypervisor is an involved task.
Make your hypervisor "multiboot-compliant" at first. This will enable it to reside as a typical entry in a bootloader configuration file, e.g., /boot/grub/menu.lst or /boot/grub/grub.cfg.
You want to set aside your 100MB at the top of memory, e.g., from 5.9GB up to 6GB. Since you mentioned Windows I'm assuming you're interested in the x86 architecture. The long history of x86 means that the first few megabytes are filled with all kinds of legacy device complexities. There is plenty of material on the web about the "hole" between 640K and 1MB (plenty of information on the web detailing this). Older ISA devices (many of which still survive in modern systems in "Super I/O chips") are restricted to performing DMA to the first 16 MB of physical memory. If you try to get in between Windows or Linux and its relationship with these first few MB of RAM, you will have a lot more complexity to wrestle with. Save that for later, once you've got something that boots.
As physical addresses approach 4GB (2^32, hence the physical memory limit on a basic 32-bit architecture), things get complex again, as many devices are memory-mapped into this region. For example (referencing the other answer), the IOMMU that Intel provides with its VT-d technology tends to have its configuration registers mapped to physical addresses beginning with 0xfedNNNNN.
This is doubly true for a system with multiple processors. I would suggest you start on a uniprocessor system, disable other processors from within BIOS, or at least manually configure your guest OS not to enable the other processors (e.g., for Linux, include 'nosmp'
on the kernel command line -- e.g., in your /boot/grub/menu.lst).
Next, learn about the "e820" map. Again there is plenty of material on the web, but perhaps the best place to start is to boot up a Linux system and look near the top of the output 'dmesg'. This is how the BIOS communicates to the OS which portions of physical memory space are "reserved" for devices or other platform-specific BIOS/firmware uses (e.g., to emulate a PS/2 keyboard on a system with only USB I/O ports).
One way for your hypervisor to "hide" its 100MB from the guest OS is to add an entry to the system's e820 map. A quick and dirty way to get things started is to use the Linux kernel command line option "mem=" or the Windows boot.ini / bcdedit flag "/maxmem".
There are a lot more details and things you are likely to encounter (e.g., x86 processors begin in 16-bit mode when first powered-up), but if you do a little homework on the ones listed here, then hopefully you will be in a better position to ask follow-up questions.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse