A trivial SYSENTER/SYSCALL question

A trivial SYSENTER/SYSCALL question - operating-system

If a Windows executable makes use of SYSENTER and is executed on a processor implementing AMD64 ISA, what happens? I am both new and newbie to this topic (OSes, hardware/software interaction) but from what I've read I have understood that SYSCALL is the AMD64 equivalent to Intel's SYSENTER. Hopefully this question makes sense.

If you try to use SYSENTER where it is not supported, you'll probably get an "invalid opcode" exception.
Note that this situation is unusual - generally, Windows executables do not directly contain instructions to enter kernel mode.

As far as i know AM64 processors using different type of modes to handle such issues.
SYSENTER works fine but is not that fast.
A very useful site to get started about the different modes:
Wikipedia

They got rid of a bunch of unused functionality when they developed AMD64 extensions. One of the main ones is the elimination of the cs, ds, es, and ss segment registers. Normally loading segment registers is an extremely expensive operation (the CPU has to do permission checks, which could involve multiple memory accesses). Entering kernel mode requires loading new segment register values.
The SYSENTER instruction accelerates this by having a set of "shadow registers" which is can copy directly to the (internal, hidden) segment descriptors without doing any permission checks. The vast majority of the benefit is lost with only a couple of segment registers, so most likely the reasoning for removing the support for the instructions is that using regular instructions for the mode switch is faster.

Related

how does the operating system treat few interrupts and keep processes going?

I'm learning computer organization and structure (I'm using Linux OS with x86-64 architecture). we've studied that when an interrupt occurs in user mode, the OS is notified and it switches between the user stack and the kernel stack by loading the kernels rsp from the TSS, afterwards it saves the necessary registers (such as rip) and in case of software interrupt it also saves the error-code. in the end, just before jumping to the adequate handler routine it zeroes the TF and in case of hardware interrupt it zeroes the IF also. I wanted to ask about few things:
the error code is save in the rip, so why loading both?
if I consider a case where few interrupts happen together which causes the IF and TF to turn on, if I zero the TF and IF, but I treat only one interrupt at a time, aren't I leave all the other interrupts untreated? in general, how does the OS treat few interrupts that occur at the same time when using the method of IDT with specific vector for each interrupt?
does this happen because each program has it's own virtual memory and thus the interruption handling processes of all the programs are unrelated? where can i read more about it?
how does an operating system keep other necessary progresses running while handling the interrupt?
thank you very much for your time and attention!

the error code is save in the rip, so why loading both?
You're misunderstanding some things about the error code. Specifically:
it's not generated by software interrupts (e.g. instructions like int 0x80)
it is generated by some exceptions (page fault, general protection fault, double fault, etc).
the error code (if used) is not saved in the RIP, it's pushed on the stack so that the exception handler can use it to get more information about the cause of the exception
2a. if I consider a case where few interrupts happen together which causes the IF and TF to turn on, if I zero the TF and IF, but I treat only one interrupt at a time, aren't I leave all the other interrupts untreated?
When the IF flag is clear, mask-able IRQs (which doesn't include other types of interrupts - software interrupts, exceptions) are postponed (not disabled) until the IF flag is set again. They're "temporarily untreated" until they're treated later.
The TF flag only matters for debugging (e.g. single-step debugging, where you want the CPU to generate a trap after every instruction executed). It's only cleared in case the process (in user-space) was being debugged, so that you don't accidentally continue debugging the kernel itself; but most processes aren't being debugged like this so most of the time the TF flag is already clear (and clearing it when it's already clear doesn't really do anything).
2b. in general, how does the OS treat few interrupts that occur at the same time when using the method of IDT with specific vector for each interrupt? does this happen because each program has it's own virtual memory and thus the interruption handling processes of all the programs are unrelated? where can i read more about it?
There's complex rules that determine when an interrupt can interrupt (including when it can interrupt another interrupt). These rules mostly only apply to IRQs (not software interrupts that the kernel won't ever use itself, and not exceptions which are taken as soon as they occur). Understanding the rules means understanding the IF flag and the interrupt controller (e.g. how interrupt vectors and the "task priority register" in the local APIC influence the "processor priority register" in the local APIC, which determines which groups of IRQs will be postponed when the IF flag is set). Information about this can be obtained from Intel's manuals, but how Linux uses it can only be obtained from Linux source code and/or Linux specific documentation.
On top of that there's "whatever mechanisms and practices the OS felt like adding on top" (e.g. deferred procedure calls, tasklets, softIRQs, additional stack management) that add more complications (which can also only be obtained from Linux source code and/or Linux specific documentation).
Note: I'm not a Linux kernel developer so can't/won't provide links to places to look for Linux specific documentation.
how does an operating system keep other necessary progresses running while handling the interrupt?
A single CPU can't run 2 different pieces of code (e.g. an interrupt handler and user-space code) at the same time. Instead it runs them one at a time (e.g. runs user-space code, then switches to an IRQ handler for very short amount of time, then returns to the user-space code). Because the IRQ handler only runs for a very short amount of time it creates the illusion that everything is happening at the same time (even though it's not).
Of course when you have multiple CPUs, different CPUs can/do run different pieces of code at the same time.

What exactly happens when an OS goes into kernel mode?

I find that neither my textbooks or my googling skills give me a proper answer to this question. I know it depends on the operating system, but on a general note: what happens and why?
My textbook says that a system call causes the OS to go into kernel mode, given that it's not already there. This is needed because the kernel mode is what has control over I/O-devices and other things outside of a specific process' adress space. But if I understand it correctly, a switch to kernel mode does not necessarily mean a process context switch (where you save the current state of the process elsewhere than the CPU so that some other process can run).
Why is this? I was kinda thinking that some "admin"-process was switched in and took care of the system call from the process and sent the result to the process' address space, but I guess I'm wrong. I can't seem to grasp what ACTUALLY is happening in a switch to and from kernel mode and how this affects a process' ability to operate on I/O-devices.
Thanks alot :)
EDIT: bonus question: does a library call necessarily end up in a system call? If no, do you have any examples of library calls that do not end up in system calls? If yes, why do we have library calls?

Historically system calls have been issued with interrupts. Linux used the 0x80 vector and Windows used the 0x2F vector to access system calls and stored the function's index in the eax register. More recently, we started using the SYSENTER and SYSEXIT instructions. User applications run in Ring3 or userspace/usermode. The CPU is very tricky here and switching from kernel mode to user mode requires special care. It actually involves fooling the CPU to think it was from usermode when issuing a special instruction called iret. The only way to get back from usermode to kernelmode is via an interrupt or the already mentioned SYSENTER/EXIT instruction pairs. They both use a special structure called the TaskStateSegment or TSS for short. These allows to the CPU to find where the kernel's stack is, so yes, it essentially requires a task switch.
But what really happens?
When you issue an system call, the CPU looks for the TSS, gets its esp0 value, which is the kernel's stack pointer and places it into esp. The CPU then looks up the interrupt vector's index in another special structure the InterruptDescriptorTable or IDT for short, and finds an address. This address is where the function that handles the system call is. The CPU pushes the flags register, the code segment, the user's stack and the instruction pointer for the next instruction that is after the int instruction. After the systemcall has been serviced, the kernel issues an iret. Then the CPU returns back to usermode and your application continues as normal.
Do all library calls end in system calls?
Well most of them do, but there are some which don't. For example take a look at memcpy and the rest.

Kernel Code vs User Code

Here's a passage from the book
When executing kernel code, the system is in kernel-space execut-
ing in kernel mode.When running a regular process, the system is in user-space executing
in user mode.
Now what really is a kernel code and user code. Can someone explain with example?
Say i have an application that does printf("HelloWorld") now , while executing this application, will it be a user code, or kernel code.
I guess that at some point of time, user-code will switch into the kernel mode and kernel code will take over, but I guess that's not always the case since I came across this
For example, the open() library function does little except call the open() system call.
Still other C library functions, such as strcpy(), should (one hopes) make no direct use
of the kernel at all.
If it does not make use of the kernel, then how does it make everything work?
Can someone please explain the whole thing in a lucid way.

There isn't much difference between kernel and user code as such, code is code. It's just that the code that executes in kernel mode (kernel code) can (and does) contain instructions only executable in kernel mode. In user mode such instructions can't be executed (not allowed there for reliability and security reasons), they typically cause exceptions and lead to process termination as a result of that.
I/O, especially with external devices other than the RAM, is usually performed by the OS somehow and system calls are the entry points to get to the code that does the I/O. So, open() and printf() use system calls to exercise that code in the I/O device drivers somewhere in the kernel. The whole point of a general-purpose OS is to hide from you, the user or the programmer, the differences in the hardware, so you don't need to know or think about accessing this kind of network card or that kind of display or disk.
Memory accesses, OTOH, most of the time can just happen without the OS' intervention. And strcpy() works as is: read a byte of memory, write a byte of memory, oh, was it a zero byte, btw? repeat if it wasn't, stop if it was.
I said "most of the time" because there's often page translation and virtual memory involved and memory accesses may result in switched into the kernel, so the kernel can load something from the disk into the memory and let the accessing instruction that's caused the switch continue.

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?

Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.

The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

Basic question regarding ROM based executable

I have basic doubt regarding executable stored in ROM.
As I know the executable with text and RO attributes is stored in ROM. Question is as ROM is for Read Only Memory, what happens if there is situation where the code needs to write into memory?
I am not able to conjure up any example to cite here (probably I am ignorant of such situation or I am missing out basic stuff ;) but any light on this topic can greatly help me to understand! :)
Last off -
1. Is there any such situation?
2. In such a case is copying the code from ROM to RAM is the answer?
Answer with some example can greatly help..
Many thanks in advance!
/MS

Read-only memory is read only because of hardware restrictions. The program might be in an EEPROM, flash memory protected from writes, a CD-ROM, or anything where the hardware physically disallows writing. If software writes to ROM, the hardware is incapable of changing the stored data, so nothing happens.
So if a software program in ROM wants to write to memory, it writes to RAM. That's the only option. If a program is running from ROM and wants to change itself, it can't because it can't write to ROM. But yes, the program can run from RAM.
In fact, running from ROM is rare except in the smallest embedded systems. Operating systems copy executable code from ROM to RAM before running it. Sometimes code is compressed in ROM and must be decompressed into RAM before running. If RAM is full, the operating system uses paging to manage it. The reason running from ROM is so rare is because ROM is slower than RAM and sometimes code needs to be changed by the loader before running.
Note that if you have code that modifies itself, you really have to know your system. Many systems use data-execution prevention (DEP). Executable code goes in read+execute areas of RAM. Data goes in read+write areas. So on these systems, code can never change itself in RAM.

Normally only program code, constants and initialisation data are stored in ROM. A separate memory area in RAM is used for stack, heap, etc.

There are few legitimate reasons why you would want to modify the code section at runtime. The compiler itself will not generate code that requires that.
Your linker will have an option to generate a MAP file. This will tell you where all memory objects are located.
The linker chooses where to locate based on a linker script (which you can customise to organise memory as you require). Typically on a FLASH based microcontroller code and constant data will be placed in ROM. Also placed in ROM are the initialisation data for non-zero initialised static data, this is copied to RAM before main() is called. Zero initialised static data is simply cleared to zero before main().
It is possible to arrange for the linker to locate some or all of the code in ROM and have the run-time start-up code copy it to RAM in the same way as the non-zero static data, but the code must either be relocatable or be located to RAM in the first instance, you cannot usually just copy code intended to run from ROM to RAM and expect it to run since it may have absolute address references in it (unless perhaps your target has an MMU and can remap the address space). Locating in RAM on micro-controllers is normally done to increase execution speed since RAM is typically faster than FLASH when high clock speeds are used, producing fewer or zero wait states. It may also be used when code is loaded at runtime from a filesystem rather than stored in ROM. Even when loaded into RAM, if the processor has an MMU it is likely that the code section in RAM section will be marked read-only.

Harvard architecture microcontrollers
Many small microcontrollers (Microchip PIC, Atmel AVR, Intel 8051, Cypress PSoC, etc.) have a Harvard architecture.
They can only execute code from the program memory (flash or ROM).
It's possible to copy any byte from program memory to RAM.
However, (2) copying executable instructions from ROM to to RAM is not the answer -- with these small microcontrollers, the program counter always refers to some address in the program memory. It's not possible to execute code in RAM.
Copying data from ROM to RAM is pretty common.
When power is first applied, a typical firmware application zeros all the RAM and then copies the initial values of non-const global and static variables out of ROM into RAM just before main() starts.
Whenever the application needs to push a fixed string out the serial port, it reads that string out of ROM.
With early versions of these microcontrollers, an external "device programmer" connected to the microcontroller is the only way change the program.
In normal operation, the device was nowhere near a "device programmer".
If the software running on the microcontroller needed to write to program memory ROM -- sorry, too bad --
it was impossible.
Many embedded systems had non-volatile EEPROM that the code could write to -- but this was only for storing data values. The microcontroller could only execute code in the program ROM, not the EEPROM or RAM.
People did may wonderful things with these microcontrollers, including BASIC interpreters and bytecode Forth interpreters.
So apparently (1) code never needs to write to program memory.
With a few recent "self-programming" microcontrollers (from Atmel, Microchip, Cypress, etc.),
there's special hardware on the chip that allows software running on the microcontroller to erase and re-program blocks of its own program memory flash.
Some few applications use this "self-programming" feature to read and write data to "extra" flash blocks -- data that is never executed, so it doesn't count as self-modifying code -- but this isn't doing anything you couldn't do with a bigger EEPROM.
So far I have only seen two kinds of software running on Harvard-architecture microcontrollers that write new executable software to its own program Flash: bootloaders and Forth compilers.
When the Arduino bootloader (bootstrap loader) runs and detects that a new application firmware image is available, it downloads the new application firmware (into RAM), and writes it to Flash.
The next time you turn on the system it's now running shiny new version 16.98 application firmware rather than clunky old version 16.97 application firmware.
(The Flash blocks containing the bootloader itself, of course, are left unchanged).
This would be impossible without the "self-programming" feature of writing to program memory.
Some Forth implementations run on a small microcontroller, compiling new executable code and using the "self-programming" feature to store it in program Flash -- a process somewhat analogous to the JVM's "just-in-time" compiling.
(All other languages seem to require a compiler far too large and complicated to run on a small microcontroller, and therefore have a edit-compile-download-run cycle that takes much more wall clock time).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse