I'm new to operating systems and was wondering if the kernel ever needs to access user data and if so why?
I'm new to operating systems and was wondering if the kernel ever needs to access user data and if so why?
The primary reason is that the calling conventions used for kernel's system calls are unable to handle an arbitrary amount of data; and to work around the restriction some system calls expect a pointer to the actual data.
For a common example, the system call to open a file might use the address of a file name string (e.g. maybe a "char * file_name"), and then the kernel (after carefully validating the address provided by user-space for security reasons) has to read the file name from user-space.
Related
The concept of paging in memory management can be used with which all schemes of binding?
By binding, I mean "mapping logical addresses to physical addresses". In my knowledge there are three types of binding schemes compile time, load time and execution time binding.
Paging is not involved in compiling, so we can rule that out.
Load time can have to meanings - combining the object modules of a program and libraries to produce an executable image (program) with no unresolved symbols (unix definition) OR transferring a program into memory so it may execute (non-unix).
What unix calls loading, some other systems call link editting.
Unix loading/link-editting is really part of compiling so doesn't involve paging at all. This operation does need to know the valid program addresses it can assign, which will permit the program to load. Conventionally these are from 0 to a very large number like 2^31 or 2^47.
Transferring an image to memory and executing can be considered either phases of the same thing, or in demand loading environments, exactly the same thing. Either way, the bit of the system that prepares the program address space has to fill out a set of tables which relate a program address to a physical address.
The program address of main() might be 0x12345; which might be viewed as offset 0x345 from page 0x12. The operating system might attach that to physical page 0x100, meaning that main() might temporarily be at 0x100345. Temporarily, because the operating system is free to change this relation (conventionally called a mapping) at any time.
The dynamic nature of these mappings is a positive attribute of paging, as it permits the system to reformulate its use of physical memory to meet changing demands.
Most POSIX named objects (or all?) have unlink functions. e.g:
shm_unlink
mq_unlink
They all have in common, that they remove the name of the object from the system, causing next opens to fail or create a new object.
Why is this designed like this? I know, this is connected to the "everything is a file" policy, but why not delete the file on close? Would you do this the same if you create a new interface?
I think, this has a big drawback. Say, we have a server process and several client processes. If any process unlinks the object (by mistake) all new clients would not find the server. (This can be prohibited by user permissions on the according file, but still...)
Would it not be better, if it had reference counting and the name would be removed automatically when the last object is closed? Why would you want to keep it open?
Because they are low level tools that could be used when performance matters. Deleting the object when it is not used to create it again on next use has a (slight) performance penalty against keeping it alive.
I once used a named semaphore that I used to synchronize accesses to a spool with various producers and consumers. I used an init module to create the named semaphore that was called as part of the boot process, and all other processes knew that the well known semaphore should exist.
If you want a more programmer friendly way that creates the object on demand and destroys it when it is no longer used, you can build a higher level library and encapsulate the creation/unlink operations in it. But if the system call included it, it would not be possible to build a user level library avoiding it.
Would it not be better, if it had reference counting and the name would be removed automatically when the last object is closed?
No.
Because unlink() can fail and because always unlinking a resource that can be shared between processes when all processes merely close that resource simply doesn't fit the paradigm of a shared resource.
You don't demolish a roller coaster just because there's no one waiting in line to ride it again at that moment in time.
This isn't quite a question about a specific OS, but let's take Windows as an example. A userspace program uses the Windows API to communicate with kernelspace. However, I don't understand how that's possible. The API, according to MS websites, lives in userspace. In order to access kernelspace it has to be in kernelspace, if I understand it correctly. So what is the mechanism by which the windows API gets extra privileges to speak to kernelspace? In which space does that mechanism operate? Is this sort of thing universal to all modern PC OS's?
As you're already aware there a bunch of facilities exposed to userspace programs by the Windows kernel. (If you're curious there's a list of system calls). These system calls are all identified by a unique number, which isn't part of the publicly documented interface given by Microsoft. Instead when you call a publicly exposed function from your program there's a DLL installed when you install (or update) Windows that has an entry point which is just a normal, unprivileged user mode function call. This DLL knows the mappings between public interfaces and the available system calls in the currently running kernel. These mappings are not always 1:1 which allows for tweaks and enhancements without breaking existing code using stable interfaces.
When some userland code calls one of these functions its role is to prepare arguments for the system call and then initiate the jump into kernel mode. How exactly that jump occurs is specific to the architecture that Windows is currently running on. In fact it varies not just between x86 and Arm but between AMD and Intel x86 systems even. I'll talk just about the modern Intel x86 32-bit case (using the SYSENTER instruction) here for simplicity. On x86 most of the other variations are relatively minor, for instance int 2Eh was used prior to SYSENTER support.
Early in boot up the operating system does a bunch of work to prepare for enabling a userland and system calls from it. Understanding this is critical to understanding how system calls really work.
First let's rewind a little and consider what exactly we mean by userland and kernelmode. On x86 when we talk about privileged vs un-privileged code we talk about "rings". There are actually 4 (ignoring hypervisors) but for various reasons nobody really used anything but ring0 (kernel) and ring3 (userland). When we run code on x86 the address that's being executed (EIP) and data that's being read/written come from segments.
Segments are mostly just a historical accident left over from the days before virtual addressing on x86 was a thing. They are however important for us here because there are special registers that define which segments are currently being used when we execute instructions or otherwise reference memory. Segments on x86 are all defined in a big table, called the Global Descriptor Table or GDT. (There's also a local descriptor table, LDT, but that's not going to further the current discussion here). The important point for our discussion here is that the (arcane) layout of the table entries include 2 bits, called DPL which define the privilege level of the currently active segment. You'll notice that 2 bits is exactly enough to define 4 levels of privilege.
So in short when we talk about "executing in kernel mode" we really just mean that our active code segment (CS) and data segment selectors point to entries in the GDT which have DPL set to 0. Likewise for userland we have a CS and data segment selectors pointing to GDT entries with DPL set to 3 and no access to kernel addresses. (There are other selectors too, but to keep it simple we'll just consider "code" and "data" for now).
Back to early on during kernel boot up: during start up the kernel creates the GDT entries we need. (These have to be laid out in a specific order for SYSENTER to work, but that's mostly just an implementation detail). There are also some "machine specific registers" that control how our processor behaves. These can only be set by privileged code. Three of them that are important here are:
IA32_SYSENTER_ESP
IA32_SYSENTER_EIP
IA32_SYSENTER_CS
Recall that we've got some code runnig in userland (ring3) that wants to transition to ring0. Let's assume that it has saved any registers that it needs to per the calling convention and put arguments into the right registers that the call expects. We then hit the SYSENTER instruction. (Actually it uses KiFastSystemCall I think). The SYSENTER instruction is special. It modifies the current code and data segment selectors based on the value that the kernel setup in the machine specific register IA32_SYSENTER_CS. (The stack/data segument values are computed as an offset of IA32_SYSENTER_CS). Subsequently the stack pointer itself (ESP) is set to the kernel stack that was setup for handling system calls earlier on and saved into the MSR IA32_SYSENTER_ESP and likewise for EIP the instruction pointer from IA32_SYSENTER_EIP.
Since the CS selector now points to a GDT entry with DPL set to 0 and EIP points to kernel mode code on a kernel stack we're running in the kernel at this point.
From here onwards the kernel mode code can read and write memory from both kernel and userspace (with some appropriate caution) to undertake the actual work needed to perform the system call. The arguments to the system call can be read from registers etc. according to the calling convention, but any arguments that are actually pointers back to userland or handles to kernel objects can be accessed to read larger blocks of data too.
When the system call is over the process is basically reversed and we end up back in userland with DPL 3 for the selectors.
Its the CPU that is acts as intermediate for transfer of information between user memory space(accessible in user mode) to protected memory space(accessible in kernel mode), via CPU registers.
Here's an Example:
Suppose a user writes a program in higher level language. Now when execution of the program happens, CPU generates the virtual addresses.
Now before any read/write operation occurs, the virtual address, is converted to physical address. Because the translation mechanism(memory management unit), is only accessible in kernel mode, cause its stored in protected memory, the translation occurs in kernel mode and the physical address is finally saved into some register of the CPU, and only then a read/write operation occurs.
When a page table entry(PTE) is not marked as valid, it means the data needed is not in memory, but on the disk. So now page fault happens and the OS is responsible to load this page of data from the disk to memory.
My question is, how does the OS know the exact disk address?
You are asking in a system dependent manner. A PTE not marked as valid may mean the address does not exist at all in the process address. A system may have another bit to indicate that the address is valid but logical to physical mapping does not exist.
The operating system needs to maintain a table of where it put the data.
The data can exist in a number of places.
1. It might be uninitialized data that has no mapping anywhere. Respond to the page fault by clearing a physical page and mapping it to the process address space.
It might be in the page file.
Some systems have a separate swap file.
It might be in the executable or shared library file.
The answer given in 2014 is correct. All the processor knows is that the page is missing - or sometimes that it had incorrect permission (e.g., write to a read-only page). At that point the processor generates a "page fault" exception which the kernel gets and now has to handle.
In some cases, this page fault will need to be passed all the way to the application, in Linux as a SIGSEGV ("segmentation violation") signal, e.g., when the user uses a null pointer. But, as you said, more usually, the kernel should and can handle the page fault. The kernel keeps, in its own tables (not inside the page table which is a structure with a specific format dictated by the processor) information about what each virtual-memory page is supposed contain. The following are some of the things the kernel may realize about the faulting page by consulting its own tables. This is not an exhaustive list.
This might be a page mmap()ed from disk. This case includes an application's explicit use of mmap(), but also happens when you run an executable, or use shared libraries - those are also mapped from disk - so the page fault can also happen when the processor executes instructions, not just when reading and writing. The kernel keeps a list of these mappings, so when it gets the page fault it can figure out where on disk it needs to read to get the missing page. So it reads from disk, and when getting the data it puts it in a new page in memory, and sets the page table entry (PTE) to point to this new page with the data, and resumes the application thread - where the faulting instruction is retried and now succeeds.
This may have been a page swapped out to disk. Again, the kernel keeps a table of which pages were swapped out, and where in the swap partition (or swap file, or whatever) this page now lives.
This might have been a write attempt to a "copy on write" page. The kernel needs to make a copy of the source page, and change the address in the page table to point to the new copy, and then allow the write. For example when you allocate a large area of memory, it can point to an existing "zero-filled" page, and only ever allocated when you first write to pages. Another example after fork() the new process's pages are all copy-on-write pages pointing to the original process's pages, and will only be actually copied when first written (by either process).
However, since you are looking for credible sources, maybe you want to read an explanation how the Linux kernel, specifically, does this, for example in:
https://vistech.net/~champ/online-docs/books/linuxkernel2/060.htm.
It is the same as virtual memory addressing.
The addresses that appear in programs are the virtual addresses or program addresses. For every memory access, either to fetch an instruction or data, the CPU must translate the virtual address to a real physical address. A virtual memory address can be considered to be composed of two parts: a page number and an offset into the page. The page number determines which page contains the information and the offset specifies which byte within the page. The size of the offset field is the log base 2 of the size of a page.
If the virtual address is valid, the system checks to see if a page frame is free. If no frames are free, the page replacement algorithm is run to remove a page.
I have read a lot about the virtual address and paging.
Let me first tell you people what i understood. When a process wants to execute something it tries to load the data from the hard disk to memory. To do this it uses a virtual address. So our MMU validates the virtual address looks into the TLB to find the corresponding physical page, if it doesn't find there it looks into Inverted Page Table and at the end it looks into Page table if it doesn't find an entry over there it generates a page fault and all the swapping of page is done and all the tables will be updated.
And as I read all the processes have different page tables and same virtual address. so if I try access an array element a[1000] which was defined as int a[100] I am sure that am gonna get a segmentation fault cause that instruction might be trying to access a memory that doesn't belong to it. but how OS comes to know that a[1000] doesn't belong to the running process by just using the concept of virtual address and physical pages. Am I missing something here or my entire understanding is wrong?
I know we can say a memory access is illegal if a process is trying to access a read only or sup true memory segment.
at the end the boiling question is how OS decides which memory is allocated to which process and how it decides that this access of memory is illegal.
What is a segmentation fault on Linux?
this link didn't help much .
thanks a lot in advance for all you lovely people's inputs :)
Actually each and every process in the Linux kernel has some well defined structure associated with it called task_struct which stores all the information about the corresponding process like its parent, its PID, its child, its address space, pending signals, threads associated with it etc. Now the address space entries tell the kernel while executing the process the legal address space for that process.Every process has its own address space allotted to it by the kernel right from the very beginning when it is created. So when a executing process tries to access the space outside its legal memory a fault is generated(called segmentation fault in Unix/Linux) and the process is terminated by giving a signal to it by the kernel. Its important for memory protection to be achieved in OS.
On x86, linux uses a combination of segmentation and paging, so the address generated by program first looks up for the corresponding segments base and limit registers values. This gives the virtual address which is then translated using the page table. When you try to access a memory which has not been allocated, the accessed page is beyond what the limit register allows, hence generating a segmentation fault.