What does kernel of an operating system do?

What does kernel of an operating system do? - operating-system

What exactly is a kernel of an operating system? What does it do? I tweaked my linux kernel a couple of times but didn't know why did I do it or what did it change.

The kernel is the central part of the operating system. It handles the 'behind the scenes' tasks so that your user space application (end user application) may run without knowing the inner details and complexities of the hardware.
Most importantly, a kernel performs:
Memory Management
Process Management
Virtual Filesystem support
Interrupt handlers
+ more.
You can read more about the inner working of Linux kernel in the wonderful book "Linux Kernel Development" by Robert Love.

The kernel is the software that handles exceptions and interrupts in the system.

Related

Is kernel mode the time when OS gains full control of the system?

I find that computer system is utilizing in dual-mode way, which comprises kernel and user mode. I wonder if kernel mode and the time OS achieves full control of the computer system are equivalent or not.

Is kernel mode the time when OS gains full control of the system?
That's not a good way of looking at things. Processors generally have multiple modes of operation (usually 2 or 4 but sometimes others) that have different levels of privilege. One of those modes is Kernel mode.
So kernel mode is the the time when the processor is executing at the highest privilege level.
The operating system may not have full control when executing in kernel mode. It is possible on some systems for application code to run in kernel mode IF the process or application has sufficient privilege.

I recommend you check Chapter 1.4.2 Dual-mode and Multimode operation from the famous Dinosaur book: Operating System Concepts,10th by Abraham Silberschatz.
CPUs from the different vendor have different implementation of kernel mode. for example, Intel processors have four different modes, 0 being a kernel mode, 3 being a user mode and ARM v8 has 7 different modes. Also, we have a sperate mode for virtual machine manager, it has more privileges than user mode but fewer than the kernel.
also, your question is not clear to me. hopefully, this will give you a decent start point.

how does the Operating Systems code and user applications code run on same processor

We all know that the Operating Systems is responsible for handling resources needed by user application. The OS is also a piece of code that runs, then how does it manages other user programs?
does the OS runs on dedicated processor and monitor the user program on some other processor?
how does the OS actually handles user applications?

It depends upon the structure of the operating system. For any modern operating system the kernel is invoked through exceptions or interrupts. The operating system "monitors" processes during interrupts. An operating system schedules timer interrupts. When the timer goes off the interrupt handler determines whether it needs to switch to a different process.
Another OS management path is through exceptions. An application invokes the operating system through exceptions. An exception handler can also cause the operating system to switch to another process. If a process invokes a read and wait system service, that exception handler will certainly switch to a new process.
In ye olde days, it was common for multi-processors to have one processor that was the dedicated master and was the only processor to handle certain tasks. Now, all normal operating systems use symmetric multi-processing where any processor can handle any task.

An entire book is needed to answer your too broad question.
Read Operating System: Three Easy Pieces (a freely downloadable book).
does the OS runs on dedicated processor and monitor the user program on some other processor?
In general no. The same processor (or core) is either in user-mode (for user programs; read about user space and process isolation and protection rings) or in supervisor-mode (for the operating system kernel)
how does the OS actually handles user applications?
Often by providing system calls which are done, in some controlled way, from applications.
Some academic OSes, e.g. Singularity, have been designed with other principles in mind (formal proof techniques for isolation).
Read also about micro-kernels, unikernels, etc.

what is left in operating system if we remove kernel? [duplicate]

This question already has answers here:
What is an OS kernel ? How does it differ from an operating system? [closed]
(11 answers)
Closed last month.
i know that operating system is nothing without kernel. But I had been asked a question in interview that-
What is (OS-Kernel). So what exactly is left if we remove kernel from operating system.
(Please do not give it negative rating if it is silly, please give answer in comments and then i will delete this question).

In addition to Sam Dunk's (see other post) statement, there is one other part that is part of the "operating system" - for a given value of operating system: The boot loader.
When a PC (and presumably other architectures) boot up, the BIOS loads the boot sector. The BIOS is not part of the operating system. The boot sector (arguably) is. The boot sector (limited to 512 bytes!) loads the bootloader.
The bootloader may give options between different operating systems (where multiple operating systems are installed on the same computer), and/or options for loading the operating system (e.g. "Safe mode", or different run levels for Unix - q.v. etc). The bootloader then loads the (appropriate) kernel, runs it. As soon as control is passed to the kernel, the bootloader is discarded (until the next boot).
The above is somewhat simplified.
For further reading on how the parts fit together (in the case of Linux), see "Inside the Linux boot process" http://www.ibm.com/developerworks/library/l-linuxboot/ for example. The master boot record is referred to as "Stage 1 boot loader", and what I referred to as "the boot loader" they refer to as "Stage 2 boot loader".
Details will vary from O/S to O/S.

To add to Sam Dunk's answer, we have to think what is the purpose of having an operating system. An OS does memory management, process scheduling, devices management etc etc...but that is not why we need an OS. It is how the OS do its job. The reason we need an OS is it abstracts the underlying hardware infrastructure for applications. Period. Nothing else. The other stuff like user interface, system utilities, are just sugar added on top (hey a command line OS is still an OS). This is the kernel, or the core of the OS. It provides a simplified and consistent platform for applications to execute across multiple hardware configurations.
For an analogy, think about the pipes and cables behind the walls in your house. Without them your wall sockets and water taps are practically useless. The sinks, cabinets, walls to separate rooms, are the system applications. (They usually come with the house, but they aren't absolutely necessary.)

Full emulation vs. full virtualization

In full emulation the I/O devices, CPU, main memory are virtualized. The guest operating system would access virtual devices not physical devices. But what exactly is full virtualization? Is it the same as full emulation or something totally different?

Emulation and virtualization are related but not the same.
Emulation is using software to provide a different execution environment or architecture. For example, you might have an Android emulator run on a Windows box. The Windows box doesn't have the same processor that an Android device does so the emulator actually executes the Android application through software.
Virtualization is more about creating virtual barriers between multiple virtual environments running in the same physical environment. The big difference is that the virtualized environment is the same architecture. A virtualized application may provide virtualized devices that then get translated to physical devices and the virtualization host has control over which virtual machine has access to each device or portion of a device. The actual execution is most often still executed natively though, not through software. Therefore virtualization performance is usually much better than emulation.
There's also a separate concept of a Virtual Machine such as those that run Java, .NET, or Flash code. They can vary from one implementation to the next and may include aspects of either emulation or virtualization or both. For example, the JVM provides a mechanism to execute Java byte codes. However, the JVM spec doesn't dictate that the byte codes must be executed by software or that they must be compiled to native code. Each JVM can do it's own thing and in fact most JVMs do a combination of both using emulation where appropriate and using a JIT where appropriate (the Hotspot JIT I think is what it's called for Sun/Oracle's JVM).

A Hypervisor is a supervisor of supervisors i.e. it's the kernel that controls kernels.
Type 1 vs Type 2 vs Hybrid Hypervisors
A Type 1 hypervisor is an OS designed to run VMs. It is installed directly on the disk to be executed from the boot sector like any OS; it is an OS purpose built to manage and run VMs and that's all you can do on it (and like an OS, it can be monolithic or microkernelised). All OSs installed on it run as guests.
A Type 2 hypervisor is a hypervisor that runs on top of an OS that's designed to run applications, either as an application (full emulation), or by modifying the kernel with a driver to give the OS functionality to run VMs (virtualisation), which installs itself below/alongside the host OS invisibly, and the host OS continues to run (in ring 0 in the case of software virtualisation and non-VMX mode ring 0 in the case of the driver supporting hardware virtualisation) and the hypervisor that is hooked below it manages the guests in VMX non-root mode (in the case of hardware virtualisation) or ring 1 (in the case of software virtualisation) while passing off to the host OS where appropriate and calling into and using the host OS and its drivers to access hardware (which is why it is often pictured to be above the host OS). The GUI program on the host OS communicates with the driver and there is a subprocess per VM and thread per vCPU.
A Hybrid hypervisor is an OS that's designed to run applications and VMs. It can run in regular host OS mode but it has a hypervisor mode, which when booted into loads the host OS as a guest ontop of a hypervisor, and can load other guests. The hypervisor is typically a microkernelised hypervisor, meaning the hardware drivers are implemented in the host OS (called the parent partition) rather than the hypervisor (on Hyper-V, the Integration Services Components drivers can be installed on other guests to communicate with host OS drivers via the VMBUS system that the host OS sets up). The host OS runs in VMX non-root mode with a VMCS. Theoretically you could get a paravirtualised hybrid hypervisor but KVM and Hyper-V only support hardware virtualisation, furthermore, you could also have a monolithic hybrid hypervisor, but doesn't make much sense, and because of the presence of the host OS, it only needs to be microkernelised. A hybrid hypervisor is essentially a type 1 hypervisor that can boot into type 1 hypervisor mode and host OS mode separately. A microkernelised hypervisor is typically hybrid because the host OS used is the one already installed (and which the microkernelised hypervisor functionality is already part of -- it's available as a feature install on windows server)
Fully emulated Type 2 Hypervisors
A full emulator emulates all registers of the target ISA as variables and the CPU is completely emulated. This can be due to wanting to emulate a guest whose ISA is not the same ISA as the host (or indeed it can be the same if you run an x86 emulator e.g. Bochs and you happen to be running it on an x86 system; it doesn't matter. As Peter says, the emulator does not need privileged accesses (ring 0 driver helper), because all interpretation and emulation is done local to the process and the process calls regular host I/O functions. This works because none of the code needs to run natively. If you want it to run natively, you have to bring this functionality to ring 0 via a driver). Full emulation is an emulation of everything: the CPU, the chipset, the BIOS, devices, interrupts, page walk hardware, TLBs. The emulator process runs in ring 3 but this is not visible to the guest which sees emulated/virtual rings (0 and 3) which will be monitored by the interpreter and will emulate interrupts by assigning values to the register variables on violation based on the instruction it is interpreting, mimicking what the CPU would do at each stage but in software. The emulator reads an instruction from an address, analyses it and every time a register e.g. EDX comes up, it will read the EDX variable (emulated EDX). It mimicks the operation of the CPU, which is slow because there are multiple operations for a single operation that is usually handled transparently by the CPU. If the guest attempts to access a virtual address, the dynamic recompiler takes this guest virtual address and traverses the guest page table (mimicking a tlb miss page walker) using the vCR3 and then it reads directly from each physical address produced by vCR3+guest virtual address part using the emulator process page table whose cr3 it has no control over as it is a process and as far as the host OS is concerned the physical address is just a virtual address in the process (guest physical maps to a host virtual by adding an offset and then acting like a host virtual address, so an implicit P2M table). If the dynamic recompiler detects an invalid bit on the guest PTE as it traverses using vCR3 then it simulates a page fault to the guest putting the address in the vCR2.
Software Virtualised Type 2 Hypervisors
Full virtualisation, which is a type 1 hypervisor scheme, can actually be used on type 2 hypervisors and is a step up in performance from the former and can only be used if the guest ISA is the same as the host ISA. Full virtualisation cannot be achieved on x86 because:
There are certain flaws in the implementation of ring 1 in the x86 architecture that were never fixed. Certain instructions that should trap in ring 1 do not. This affects, for example, the LGDT/SGDT, LIDT/SIDT, or POPF/PUSHF instruction pairs. Whereas the "load" operation is privileged and can therefore be trapped, the "store" instruction always succeed. If the guest is allowed to execute these, it will see the true state of the CPU, not the virtualized state. The CPUID instruction also has the same problem.
Actually, this applies to ring 3 too. It's not just a glitch with ring 1. SGDT etc is not a privileged instruction, but allowing the VM to execute it contradicts Popek and Goldberg requirements because the VM can read the real state of the CPU and get the address of the real GDT rather than the virtual. Before UMIP, software full virtualisation was not possible on x86, and before Intel VT, x86 CPUs didn't inherently conform to Popek and Goldberg's requirements, so paravirtualisation had to be used. Paravirtualisation still does not conform to Popek and Goldberg (because only kernel mode code is patched, so SGDT can be used), but at least it works, whereas full virtualisation doesn't work at all, because SGDT will read a bogus value (the host SGDT) in guest kernel mode, meaning the guest kernel code using SGDT will not work as desired if it is not patched. SGDT being available in user mode at least doesn't compromise the host OS, whereas LGDT definitely would.
VirtualBox uses ring 1 full virtualisation, but paravirtualises the problematic instructions that act like they are executing in ring 0 despite being in ring 1, and requires the help of a ring 0 driver; the driver functions as the hypervisor. Surprisingly, there is very little information on how type 2 hypervisors are implemented. The following will be my best guess on the matter -- how I would implement a type 2 hypervisor given the hardware and host OS operation.
On Windows, I'd imagine when the driver starts it will initialise symbolic links and wait for the user mode virtualbox software to issue IOCTLs using DeviceIoControl to start a virtual machine instance. The handler will perform the following process: The driver injects a handler into the IDT for the general protection fault. It can do this by putting a wrapper around KiInterruptDispatch by replacing KiInterruptTemplate in the IDT with the wrapper. On windows, it could inject a wrapper into all IDT entries including entry bug check entries but this means hooking into the IDT write routines for new interrupts. What it probably does to achieve this is read the virtual address in IDTR and write protect the region and then host updates to the IDT will trap into the hypervisor GPF wrapper which will install a wrapper at the IDT entry written to.
However, a 64 bit windows guest on a 64 bit windows host needs to be able to have its own kernel space, but the problem is, it will be at exactly the same location as the host kernel structures. Therefore, the driver needs to wipe the whole kernel view of the virtualbox process. This cannot be mapped in or visible to the guest. It does this by removing the entries from the cr3 page of the virtualbox process. The GDT and IDT used by the virtualbox process and other host processes needs to be the same, but in order to avoid reserving guest virtual addresses, when the guest writes to the IDTR, the hypervisor could use this as the actual IDTR value, but virtually map it in the SPT to the same physical 4KiB IDT frame that the host uses. This means that the hypervisor driver needs to change the IDTR when switching between the guest and host threads. Because the guest virtual page that maps the IDT is write protected, any writes to this range by the guest will be logged by the hypervisor in a guest IDT that it builds if the cr3 is of one of its guests' processes. The issue is that when the ISR is handled, it will jump to a hypervisor RIP that is not mapped into the process because the driver lies in the host kernel; therefore, the RIP of this wrapper needs to be mapped into the SPT. This means you can't get away with reserving no virtual memory in the guest, and for that reason, you could probably get away with reserving the 4KiB address range the host uses for its IDT and silently redirecting guest accesses to a different host physical page and then not having to change the IDTR on a task switch. All reserved memory for the handlers in the host IDT would also have to be redirected silently to different host physical pages (because they will be supervisor pages so they will fault anyway and the hypervisor just redirects the reads and writes to a different host physical page, which won't happen after an interrupt because it will be in ring 0, so the jump in the IDT will be in the real host physical page mapped to it as it doesn't GPF so the hypervisor can't redirect), so the guest is unaware that that region is reserved. There will be a different wrapper for each IDT entry which will call a main handler which also needs to be mapped and pass an IDT entry code. The handler will pass the cr3 in a register, change the cr3 to a dummy process that maps the host kernel and then it will call the main handler. The handler checks the cr3 and if it is a guests shadow cr3 or host cr3 and perform the appropriate action.
The driver will also have to inject itself into the clock interrupt in the same way -- if the clock interrupt fires, the guest state or host state (which includes current cr3) is pushed and the hypervisor handler will push the address of the guest IDT clock interrupt onto the kernel stacks of all vCPU threads it manages (emulating what the CPU would do) in a new trap frame if there isn't one already present and then call the original host handler after changing the cr3 to one that maps the host kernel. This would ensure a context switch in the guest every time it is scheduled in on the host and therefore guest clock interval would roughly match up to host clock interval.
Full virtualisation would be referred to as 'trap and emulate', but it is not full emulation because all ring 3 code actually runs on the host CPU (as opposed to full emulation where the code that runs is the interpreter which fetches lines to read). Also, the TLBs and page walk hardware are actually used directly whereas on the emulator, every memory access requires a walk in software if not present in an emulated TLB array in software. Only the privileged instructions and registers, interrupts, devices and BIOS are emulated to the guest -- partial emulation -- emulation still occurs, but when any amount of the code runs natively, it becomes referred to as a virtualisation (full, para or hardware assisted).
When the guest traps into the guest OS it will either use INT 0x2e or syscall. The hypervisor obviously has injected a wrapper at 0x2e for INT and it will insert a handler at the SYSENTER_CS_MSR:SYSENTER_EIP_MSR for sysenter or IA32_LSTAR MSR for syscall. The handler in the MSR needs to be mapped into the SPT and will check to see if the cr3 is the shadow of one of the guest processes and if it isn't it doesn't need to change cr3 as the current will contain the host kernel and jumps to the host handler. If it is a cr3 of a guest process, it changes the cr3 to a dummy process (probably a virtualbox host process specifically for IO tasks that maps the host kernel) and jumps to a main handler, passing RIP in guest IDT that it has built to the recompiler/patcher which walks through and paravirtualises certain instructions that aren't guaranteed to trap, replacing them permanently with jumps to hypervisor memory where it places better code (which will cause protection faults as they're ring 0 in the SPT) until it reaches a IRET or sysexit etc and then it changes back the cr3 to that of the guest and executes an IRET after putting a ring 1 privilege on the stack to the RIP in the guest IDT it has built and then the actual guest ISR executes. When a trap occurs due to executing a ring 0 instruction in ring 1 or an inserted paravirtualised trap occurs, the ISR injected at the general protection fault entry / hypervisor ISR will make sure that the cr3 is of a guest process and it will claim and handle the issue, if it isn't then the cr3 doesn't need to be changed to one that includes host kernel in order to pass control to the host handler because it will be in the context of a non guest process. One instance where this could occur is the guest writing to cr3 for a guest context switch. This needs to be emulated as the guest must not be able to execute this instruction and modify the cr3 because it would change the cr3 of the host process on the host OS; the hypervisor needs to incept the write and write a new shadow cr3 and not the cr3 the guest wants. When the guest reads cr3, this mechanism prevents the guest from reading the real cr3 and the hypervisor inserts the value of the guest inserted cr3 (not the shadow one) into the requested register, inserts next instruction address onto the stack and resumes execution with an iret to the ring it was in.
Guest I/O will be targeted at a guest physical address space that maps onto virtual buffers and registers of emulated devices defined in the hypervisor. These emulated registers (e.g. doorbell registers) will be checked in a host context at regular intervals (clock interrupt hook for instance) in the exact same way a device would react to changes to hardware registers and the handler will decide whether an interrupt needs to be emulated (pushing an interrupt onto the kernel stack of the thread representing the selected vCPU to interrupt based on MSI vector assigned by guest in the emulated configuration space) or, due to an emulated register write, an I/O operation needs to be constructed using the Native windows API functions to the guest specified buffer (translating GVA->HPA and allowing real hardware to write to the physical page that the guest buffer will use).
As for paging on a paravirtualised 64 bit type 2 hypervisor, it is a tricky one. The hardware uses a shadow page table (SPT) which is a mapping of GVAs to HPAs. My best guess is that the hypervisor driver selects a shadow cr3 page from the locked pages of the virtualbox process for every GP fault (executing ring 0 instruction in ring 1) that it sees a new guest assigned cr3 address being written to cr3. It pairs this guest chosen address with the address of the hypervisor chosen shadow cr3 page and changes the virtualbox process cr3 to that of the shadow cr3 rather than the guest one that was attempted to be written. The shadow cr3 page (you'll see written everywhere that the guest page tables are write protected but it just has to be wrong because it is the shadow page tables that run on the CPU and therefore are the only ones that can cause protection faults; the shadow cr3 is used not the guest cr3) is write protected by the kernel driver (which is done by read/write bit in the recursive PML4 entry to itself). The cr3 page of certain GPA that the guest attempts to use will be translated to its associated HPA by the hypervisor and the entries in the cr3 page will be copied to the shadow cr3 and GPA addresses in the PML4Es will be translated to the HPAs using the P2M table. Every time the guest goes to write this to the guest cr3 page by virtual address, this virtual address will always be of the shadow cr3 page, not the guest cr3 page, and it will fault because of the write protect bit and being in ring 1. The handler injected at the general protection fault will then see a shadow cr3 of one of its guests processes and it will perform the write that was conceptually attempted in guest PTE, in the SPT at the same location (where it actually faulted), and it inserts the host physical address instead of the guest physical address that it tried to write (which it translates using the P2M TLB or P2M; I think the P2M is filled when you start the VM, because the VirtualBox process uses VirtualLock to lock the specified amount of RAM for the virtual machine) (the hypervisor can maintain virtual TLBs for the P2M (guest frame to host frame mappings) and guest page tables (guest virtual page to guest frame mappings), which it can check before performing software page walks, whereas the hardware maintains the TLBs for the SPT). Then the hypervisor will check the virtual TLBs for a quick translation of the CR2 GVA to a GPA; if not present it will trace the guest page table (by accessing the guest cr3 via its HVA (translates GPA->HPA using P2M and then HPA->HVA using a kernel function)) and write to the entry as the guest wanted with the attempted guest GPA. When a page fault occurs, the handler checks the shadow cr3 is one of its guest processes and checks the SPT (gets virtual address of entry associated with faulting GVA using Windows kernel function, as if it were a regular process) and then walks the guest page table using the guest cr3 associated with the current cr3, parsing the SPT virtual address that faulted (translates GPA -> HPA -> HVA). If the shadow PTE is invalid then it is a shadow page fault. If the guest PTE is invalid as well then it emulates an interrupt using the RIP of the address in the page fault entry of the guest IDT pushing it on the stack; before it does this it patches the code in the recompiler as described before (when guest reads from its page table during the interrupt, it will actually be reading the SPT and therefore the SPT needs to be read protected with a supervisor bit so it can be intercepted and the guest page table entry can be read instead from the address in the faulting memory access). For any other interrupt that occurs i.e. a host device, it is not meant for the guest and therefore if the handler sees the current cr3 belongs to a process of one of its guests it will change the cr3 to a dummy process that contains the host kernel mapping and calls the original KiInterruptTemplate for the host handler; after the host handler has finished, it will replace the cr3.
Hardware assisted Type 2 Hypervisors
Hardware assisted type 2 is a further step up in performance and makes the situation a lot less convoluted and unifies it into a single interface and automates lots of makeshift cr3 juggling and administrative tasks that needed to be improvised, making it a lot cleaner. The kernel driver only needs to execute vmxon, wait for guests to register with the driver and then all VM Exit events will be handled by a unified handler at a RIP and CR3 it inserts into the VMCS host state (meaning the handler stub does not need to be mapped in the guest kernel virtual address space). It is specifically designed for this, unlike ring 1, which means the recompiler (Code Scanning and Analysis Manager (CSAM) and the Patch Manager (PATM)) is not required. It also has things like TSC scaling and TSC offset fields which can be used by guests which employ the TSC for fairer scheduling. The hypervisor still hooks the clock interrupt to perform I/O updates and if the currently executing thread is the address of the thread for one of its vCPUs, it will need to vxmoff (which will cause a VM exit) and push the address of some reinitialisation sequence in host kernel memory that will vmxon and vmresume the VMCS tied to the vCPU with the guest saved state in it (but with an emulated clock interrupt in place ready to execute, whose code will use RDTSC which will VM exit and the offsets in VMCS can be used by the hypervisor to report a value accounting for time the guest wasn't scheduled in on the host, i.e. to subtract host time away from it to make the host invisible). It doesn't need to change the cr3 because the vmxoff does that automatically so now it can pass it to the host handler to perform the clock interrupt handing procedure for the host OS.
If EPT is supported, then the guest chosen physical addresses (cr3, IDTR etc.) and page tables run on the actual hardware in vmx non-root mode. GVAs are translated to HPAs as such: the guest CR3 is used to produce a GPA of the PDPT, which is then run through the whole EPT using the EPTP of the guest to eventually get the HPA of the PDPT, and so on (it's the same process as software virtualisation with the guest page table and the P2M, except the page walk is done on actual page walk hardware, which is faster). When there is a page fault, a vm exit does not occur and the guest chosen IDTR is present so the interrupt gets handled as a non root ring 0 using the guest IDT. The guest can update this mapping and the hypervisor doesn't need to intervene. When the access is reattempted, an EPT fault will cause a VM exit with the EPTP equivalent of cr2 and a pointer to the hypervisors EPTP for the guest. It will then update its mapping and VMRESUME to the RIP of the faulting instruction.

In full emulation the I/O devices , CPU , main memory are virtualized.
No, they are emulated in software. Emulated means that their behavior is completely replicated in software.
But what exactly is full virtualization?
With virtualization, you try to run as much code as you can on the on hardware to speed up the process.
This is especially a problem with code that had to be run in kernel mode, as that could potentially change the global state of the host (machine the Hypervisor or VMM is running on) and thereby affect other virtual machines.

Without either emulation or virtualization, code runs directly on the hardware. Its instructions are executed natively by the CPU, and its I/O accesses directly access the hardware.
Virtualization is when the guest code runs natively at least some of the time, and only traps to host code running outside the virtual-machine (e.g. a hypervisor) for privileged operations or I/O accesses.
To handle these traps (aka VM exits), the VM may actually emulate what the guest was trying to do. E.g. the guest might be running a device driver for a simple network card, but the NIC is implemented purely in software in the VM. If the VM used a pass-through to send the guest's I/O accesses to a real network card on the host, that would be virtualization of that hardware. (Especially if it did it in a way that let multiple guest use it at once, otherwise it's really just giving it to one guest, not virtualizing it.)
Hardware support for virtualization (like Intel's and AMD's separate x86 virtualization extensions) can let the guest do things that would normally affect the whole machine, like modify the memory mappings in a page table. So instead of triggering a VM exit and making the VM figure out what the guest was doing and then modifying things from the outside to achieve the result, the CPU just has an extra translation layer built in. (See the linked wiki article for a much better but longer description of software-based virtualization vs. hardware-assisted virtualization.)
Pure emulation means that guest code never runs natively, and never sees the "real" hardware of the host. An emulator doesn't need privileged access to the host. (Some might want privileged access to the host for device pass-through, or for raw network sockets to let a guest look like it's really attached to the same network as the host).
An ARM emulator running on an x86 host always has to work this way, because the host hardware can't run ARM instructions in the first place.
But you can still emulate an x86 guest on an x86 host, for example. The fact that the guest and host architectures match doesn't mean the emulator has to take advantage of that fact.
For example, BOCHS is an x86 PC emulator written in portable C++. One of its main uses is for debugging bootloaders and OSes.
BOCHS doesn't care if it's running on an x86 host or not. It's just a C++ program that reads binary files (disk images) and draws in a window (contents of guest video memory). As far as the host is concerned, it's not particularly different from a JPG viewer or a game.
Some emulators use binary translation to JIT-compile the guest code into host code, but this is still emulation, not virtualization. See http://wiki.osdev.org/Emulator_Comparison.
BOCHS is relatively slow, since it reads and decodes guest instructions directly, without doing binary translation. But it tries to do this as efficiently as possible. See How Bochs Works Under the Hood for some of the tricks it uses to efficiently keep track of the guest state. Since emulation is the only option for running x86 software on non-x86 hardware, it's useful to have a high-performance emulator. BOCHS has some very smart and experienced emulator developers working on it, notably Darek Mihocka, who has some interesting articles about optimizing emulation on his site.

This is an attempt to answer my own question.
System Virtualization : Understanding IO virtualization and role of hypervisor
Virtualization
Virtualization as a concept enables multiple/diverse applications to co-exist on the same underlying hardware without being aware of each other.
As an example, full blown operating systems such as Windows, Linux, Symbian etc along with their applications can coexist on the same platform. All computing resources are virtualized.
What this means is none of the aforesaid machines have access to physical resources. The only entity having access to physical resources is a program known as Virtual Machine Monitor (aka Hypervisor).
Now this is important. Please read and re-read carefully.
The hypervisor provides a virtualized environment to each of the machines above. Since these machines access NOT the physical hardware BUT virtualized hardware, they are known as Virtual Machines.
As an example, the Windows kernel may want to start a physical timer (System Resource). Assume that ther timer is memory mapped IO. The Windows kernel issues a series of Load/Store instructions on the Timer addresses. In a Non-Virtualized environment, these Load/Store would have resulted in programming of the timer hardware.
However in a virtualized environment, these Load/Store based accesses of physical resources will result in a trap/Fault. The trap is handled by the hypervisor. The Hypervisor knows that windows tried to program timer. The hypervisor maintains Timer data structures for each of the virtual machines. In this case, the hypervisor updates the timer data structure which it has created for Windows. It then programs the real timer. Any interrupt generated by the timer is handled by the hypervisor first. Data structures of virtual machines are updated and the latter's interrupt service routines are called.
To cut a long story short, Windows did everything that it would have done in a Non-Virtualized environment. In this case, its actions resulted in NOT the real system resource being updated, but virtual resources (The data structures above) getting updated.
Thus all virtual machines think they are accessing the underlying hardware; In reality unknown to them, all accesses to physical hardware is mediated through by the hypervisor.
Everything described above is full/classic virtualization. Most modern CPUs are unfit for classic virtualization. The trap/fault does not apply to all instructions. So the hypervisor is easily bypassed on modern devices.
Here is where para-virtualization comes into being. The sensitive instructions in the source code of virtual machines are replaced by a call to Hypervisor. The load/store snippet above may be replaced by a call such as
Hypervisor_Service(Timer Start, Windows, 10ms);
EMULATION
Emulation is a topic related to virtualization. Imagine a scenario where a program originally compiled for ARM is made to run on ATMEL CPU. The ATMEL CPU runs an Emulator program which interprets each ARM instruction and emulates necessary actions on ATMEL platform. Thus the Emulator provides a virtualized environment.
In this case, virtualization of system resources is NOT performed via trap and execute model.

A more recent response:
From my research i can say that this is a better response to understand how concept appear:
The first concept of emulation actually dates back to the first computer, the Colossus. It was used by the British government in 1941 to mimic the functions of the Nazi Enigma code machine. Emulation theory was developed in 1962 and was conceived by three IBM engineers working from three different angles.
Emulation means to mimic the behavior of the target which can be hardware, like the emu8086 emulator, or can be software like emulation of a service from some network port.
You want to immitate the set of functions provided by the target and maybe you are not interested in the internal mechanism.
Why would you want that? For controlling that functions. Why control? For multiple reason which is very large subject to be discuss here. But keep in mind that you want to be behind the things.
But such process is costly for performance. You have an instruction for which are executed a lot of other instruction. Maybe you are interested to control only some of that instructions. So we would like to permit some of instructions to be executed native.
So what happens when all of this instructions execution became native? Then you have ideal virtualization. You can virtualize any software, but the trend today is to pass from virtualization of operating systems to that of application. Also i say ideal because this software have a different execution on each hardware so it will be need to also emulate some instructions.Is important to understand that most of virtualize technologies from today are not only about virtualize, but also about emulation.
Also notice that in our transition from emulation to virtualization, the input which of system is reduced, because virtualization accept only software as input. The controller of these flow of instructions is named HyperVisor.

Virtualization may happen at different layers of a computer architecture, which are (from higher to lower): 1: Application, 2: Library, 3: Operating System, 4: Hardware Abstraction (HAL), 5: Instruction Set Architecture (ISA). Below the latter layer there is the Hardware.
Tipically a certain layer utilizes services from a lower layer by utilizing the instructions the lower layer exposes in its interface.
Note that the usage of service is not strictly related to the layering, in the sense that certain layers can skip the layer immediately below and utilize instruction from lower layers. As an example an Applications may provide certain instructions directly to the HAL layer, skipping the Library and O.S. layers.
To "emulate an instruction" means to intercept and map an instruction intended for a certain layer of a computer architecture (virtual) into a sequence (one or more) instruction(s) for the same layer of a different computer architecture (non-virtual).
It is possible to place the virtualization layer at different layers of a Computer Architecture. This point may introduce confusion.
As an example, when virtualizing at the level of the Hardware Abstraction Layer (e.g. VMware, VirtualBox), a virtual layer is placed between the HAL layer and the Operating system Layer. The Operating system utilizes instructions of the virtual HAL Layer, then certain virtual ISA (Instruction Set Architecture) are mapped by the hypervisor to ISA for the physical system. When ALL the instruction are emulated, we talk about full emulation, which is a special case of virtualization. In virtualization tipically we try to make a layer to execute directly instruction of the non-virtual layer as much as possible for performance reasons.
In another example, the virtualization layer is placed over the Operative System (Virtualization at Operative System Level): in this case a Virtual Machine is named Container (e.g. Docker). It includes the levels from Application to the O.S. (included).
To conclude, emulation is related to single instruction, while "full emulation" happens when we intercept and map ALL the instructions of a certain layer.
Tipically, the term "full emulation" is used when the virtualization layer is placed at the ISA level (lower level possible). In this case a Virtual Machine includes all the levels from the Application to the ISA, and ALL the ISA are intercepted and mapped. This is tipically used to virtualize niche products, such as Cisco routers (e.g. with QEMU) or 90's video game consoles, having a completely different architecture from the usual commonly available computers. Note however that there may be a "full emulation" also at other levels, which is tipically not necessary.

Virtualization and Emulation are pretty much the same thing. There is one underlying concept that these two words hint at. That is, these two words are aspects of one thing. This is demonstrated in QEMU, a Quick Emulator that performs hardware virtualization.
You can think of that one thing as Simulation. Simulation can also be a confusing word though.
First we can define the common meaning of the words.
Simulation: Making one thing do what another thing does.
Emulation: Making one system replicate another system exactly.
Virtualization: Allow for running of a system within another system.
Now we show that the words all mean pretty much the same thing. For example, in simulation you are creating a replica of one system with another system. That is the common meaning of emulation. In virtualization, you want to have your virtualized system act like the real system. That is, ideally it acts like a replica, even though it may be implemented differently and may not "emulate" the hardware exactly. That is the same as simulation pretty much. In an emulation, you simulate another system, etc..
So we can see that the words are somewhat interchangeable. The underlying concept is simulation.
In virtualization, such as operating system virtualization ("virtual machines"), we are creating a system which acts like the operating system. It might use tricks from the underlying hardware, or hypervisors, or other things, for performance and security. But in the end it is just a simulation of an operating system. Typically when the word "virtual machine" is used, it is not an exact replica of the machine (as in an emulator). It just does enough to allow programs to run as you would expect on the real operating system.
In emulation, it is typically meant that the simulation is "exact". In hardware emulation, you replicate all of the features of the hardware system. This means that you have created a simulation of the hardware. You could say that you created a virtualization of the hardware, but here is where virtualization slightly differs. Virtualization implies creating an isolated environment, which emulation doesn't necessarily imply. So a hardware emulator might provide the same interface to the hardware as the hardware itself, but the implementation of the emulator might rely on global memory, so if you try to run two emulators at the same time, they would interfere with each other. This is what virtualization solves, it isolates the simulations.
Hope that helps.

I think it's a common misconception to oppose Virtualization to Emulation when they're not comparable.
What people have in mind when they talk about Virtualization is mostly what type 2 hypervisors do.
According to wikipedia, virtualization is :
Virtualization or virtualisation (sometimes abbreviated
v12n, a numeronym) is the act of creating a virtual (rather than
actual) version of something, including virtual computer hardware
platforms, storage devices, and computer network resources.
This definition suits both emulation and type 2 hypervisor. Therefore, an Emulator is a subtype of virtualization, and Type 2 Hypervisor is another subtype. Both let you run a virtual machine, but the way they work and the way they're used often differ. Many virtual machines actually rely on both techniques to achieve their goal.
Moreover, emulation doesn't always replicate the original hardware 1:1 (by design and not by lack of documentation), such as DOSBox which simulates a kind of PC that doesn't really exist, or high level emulators (like the old Ultra HLE). This makes the emulator more efficient (but with the risk of breaking compatibilty with softwares). Other emulators also do this for a different purpose : to expand the capabilities of the original hardware (such as dolphin that let you run the game in 4K, or PS1 emulators that let you improve dramatically the quality of the 3D, or more recently, a SNES emulator with a modified PPU that can output 16:9 graphics and that's used for a modded super mario world patched to run in widescreen).
Some emulators can also use hardware ressources like a video card. An example of this is Connectix VirtualPC, an old PC emulator for PowerPC based macs. Back then macs and PCs both had PCI slots, and Connectix VirtualPC gave you the possibility to use a video card that was physically in your mac (which also existed on PC).
I hope this clarifies things.

Why does syscall need to switch into kernel mode?

I'm studying for my operating systems final and was wondering if someone could tell me why the OS needs to switch into kernel mode for syscalls?

A syscall is used specifically to run an operating in the kernel mode since the usual user code is not allowed to do this for security reasons.
For example, if you wanted to allocate memory, the operating system is privileged to do it (since it knows the page tables and is allowed to access memory of other processes), but you as a user program should not be allowed to peek or ruin the memory of other processes.
It's a way of sandboxing you. So you send a syscall requesting the operating system to allocate memory, and that happens at the kernel level.
Edit: I see now that the Wikipedia article is surprisingly useful on this

Since this is tagged "homework", I won't just give the answer away but will provide a hint:
The kernel is responsible for accessing the hardware of the computer and ensuring that applications don't step on one another. What would happen if any application could access a hardware device (say, the hard drive) without the cooperation of the kernel?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse