Why is RAX not used to pass a parameter in System V AMD64 ABI? - x86-64

I don't understand what the benefit of not passing a parameter in RAX,
Since the return value is in RAX it is going to be clobbered by the callee anyway.
Can someone explain?

x86-64 System V does use AL for variadic functions: the caller passes the number of FP args in XMM registers.
(This is only an optimization to allow the callee to not dump all the vector regs into an array; the number in AL is allowed to be higher than the number of FP args. In practice, gcc's code-gen for variadic functions just checks if it's non-zero and dumps either none or all 8 of xmm0..7. I think the ABI guarantees that it's safe to always pass al=8 even if there aren't actually any FP args, and that you can't pass pass FP args on the stack instead by setting al=0)
But why not use r9b for that, and use RAX for the 6th arg? Or RAX for some earlier arg?
Because RAX has so many implicit uses in x86, and experiments when designing the calling convention (http://web.archive.org/web/20140414124645/http://www.x86-64.org/pipermail/discuss/2000-November/001257.html) found that using RAX tended to require extra instructions in the caller or callee. e.g. because RAX was often needed as part of computing other args in the caller, or was needed while doing something with one of the other args before the code gets around to using the arg that was passed in RAX.
RAX is used for rep stos (which gcc used to use more aggressively to inline memset), and it's used for div and widening (one-operand) mul/imul, which gcc uses for division by a compile-time constant. (Why does GCC use multiplication by a strange number in implementing integer division?).
Most of the other RAX special uses are just shorter encodings of things you can also do with other registers, like cdqe vs. movsxd rax, eax (or between any other registers). Or add eax,imm32 (no ModRM) vs. add r/m32, imm32 (or most other ALU instructions). See one of my answers on
Tips for golfing in x86/x64 machine code. Original 8086 lacked many of the longer non-AX alternatives, but between 8086 and 386, stuff like imul r32,r32 and movsx/movzx were added. Other RAX-only instructions aren't worth using when optimizing for speed (like xlatb, lodsd), or are obsolete by P6 / AMD64 extensions (lahf as part of FP compares obsoleted by fucomi and using SSE/SSE2 ucomisd for FP math), or are specialized instructions like cmpxchg or cpuid that are too rare to have an impact on calling convention design. Compilers didn't use the BCD instructions like aaa anyway, and AMD64 removed them.
The designers of the x86-64 System V calling convention (primarily Jan Hubička for the integer arg-passing register design) generally aimed to avoid registers with many / common implicit uses. rdx comes before rcx in the arg-passing order, because cl is needed for variable shift counts (without BMI2). These are maybe more common than mul and div, because 2-operand imul reg,reg allows normal non-widening multiplies without clobbering RDX:RAX.
The choice of rdi and rsi as the first 2 args was apparently motivated by inlining memset or memcpy as rep movs (which gcc did back in 2000, even though it wasn't actually a good choice in many of the cases where gcc did that). Even though rep-string instructions use RCX as the counter, they still found it on average saved instructions to pass the 3rd arg in RDX instead of RCX, so the calling convention doesn't quite work out for memcpy to be rep stosb/ret.
Jan Hubička evaluated multiple variations on arg-passing registers by compiling SpecInt with a then-current version of x86-64 gcc. See my answer on Why does Windows64 use a different calling convention from all other OSes on x86-64? for some more details and links.
One of the arg-register orders he evaluated was RAX, RDX, RCX, RBX, RSI, RDI, but he found that less good than other options. (See the mailing list message linked above).
It's fairly common for RISC calling conventions to pass the first arg in the first return-value register. ARM does this (r0), and I think so does PowerPC. Others (like MIPS) don't. But all of those architectures have no implicit uses of most integer registers, often just a link register and maybe the stack pointer.
x86-64 SysV and Windows do this for FP args: xmm0 for passing and returning.

Related

Compiling an OS and defining the system calls

I'm trying to better understand operating systems, not the theory behind them but how real people write real OS code.
I know most OS's are written in C. I know the source code for these OS's include calls to functions like malloc, calloc, etc, to allocate memory for a process, etc.
Under normal conditions, i.e, when compiling code destined to run on an OS, I know that the C compiler will use the underlying OS's system calls to execute these functions. But when compiling the source code for these OS's, how does the compiler know what to do. The system calls don't exist cause they're defined by the OS. Does the compiler just call some assembly routine, which will eventually become a system call?
It is complex because you need to understand several things about OS development to get the big picture. Overall, the OS isn't like a process which executes in-order like average user mode code. When the computer boots, the OS will execute in-order to set up its environment. After that, the OS is basically system calls and interrupts.
Each CPU works differently but most CPUs will have an interrupt table mechanism and a syscall mechanism. The interrupt table specifies where to jump for a certain interrupt number and the syscall mechanism is probably one register containing the address of the entry point for a syscall. It works like this on x86-64 (most desktop/laptop computers). x86-64 has the IDT (Interrupt Descriptor Table) and the syscall register is IA32_LSTAR.
The OS isn't written in C because you can call malloc() or anything. The OS is written in C because you can make C static and freestanding (all code needed is in the executable and it doesn't rely on any external software to the executable). Actually, when writing an OS, you cannot call malloc(). You need to avoid any standard library function implementations and use static and freestanding code (the base of C like structs, pointers, arithmetic, variables, etc). C is also used because you can modify arbitrary memory locations with pointers. For example,
unsigned int* ptr = (unsigned int*)0x1234;
*ptr = 0x87654321;
makes sure that address 0x1234 contains 0x87654321. You can also use binary operators (and, or, xor, shift, etc) to modify memory at the bit level.
To answer your question, if you want to define the system calls in an OS that you write yourself, you simply consider your syscalls to be that way. When you write your syscall handler, you consider that someone using your OS knows that a certain syscall number is requesting a certain operation so you oblige by doing that. For example, Linux uses SysV as a convention for syscall numbers and registers used to pass arguments to them (including the syscall number). On x86-64 Linux, from user mode, you put your syscall number in RAX and use the instruction syscall. The processor then looks in IA32_LSTAR for the address of the syscall handler and jumps to it. The processor (the core) is now in kernel mode (in the kernel). The kernel now looks at RAX for the syscall number and answers the request by the doing the associated operation (after several checkups and a bunch of other things).

Where does the value returned by the syscalls stored in linux? [duplicate]

Following links explain x86-32 system call conventions for both UNIX (BSD flavor) & Linux:
http://www.int80h.org/bsdasm/#system-calls
http://www.freebsd.org/doc/en/books/developers-handbook/x86-system-calls.html
But what are the x86-64 system call conventions on both UNIX & Linux?
Further reading for any of the topics here: The Definitive Guide to Linux System Calls
I verified these using GNU Assembler (gas) on Linux.
Kernel Interface
x86-32 aka i386 Linux System Call convention:
In x86-32 parameters for Linux system call are passed using registers. %eax for syscall_number. %ebx, %ecx, %edx, %esi, %edi, %ebp are used for passing 6 parameters to system calls.
The return value is in %eax. All other registers (including EFLAGS) are preserved across the int $0x80.
I took following snippet from the Linux Assembly Tutorial but I'm doubtful about this. If any one can show an example, it would be great.
If there are more than six arguments,
%ebx must contain the memory
location where the list of arguments
is stored - but don't worry about this
because it's unlikely that you'll use
a syscall with more than six
arguments.
For an example and a little more reading, refer to http://www.int80h.org/bsdasm/#alternate-calling-convention. Another example of a Hello World for i386 Linux using int 0x80: Hello, world in assembly language with Linux system calls?
There is a faster way to make 32-bit system calls: using sysenter. The kernel maps a page of memory into every process (the vDSO), with the user-space side of the sysenter dance, which has to cooperate with the kernel for it to be able to find the return address. Arg to register mapping is the same as for int $0x80. You should normally call into the vDSO instead of using sysenter directly. (See The Definitive Guide to Linux System Calls for info on linking and calling into the vDSO, and for more info on sysenter, and everything else to do with system calls.)
x86-32 [Free|Open|Net|DragonFly]BSD UNIX System Call convention:
Parameters are passed on the stack. Push the parameters (last parameter pushed first) on to the stack. Then push an additional 32-bit of dummy data (Its not actually dummy data. refer to following link for more info) and then give a system call instruction int $0x80
http://www.int80h.org/bsdasm/#default-calling-convention
x86-64 Linux System Call convention:
(Note: x86-64 Mac OS X is similar but different from Linux. TODO: check what *BSD does)
Refer to section: "A.2 AMD64 Linux Kernel Conventions" of System V Application Binary Interface AMD64 Architecture Processor Supplement. The latest versions of the i386 and x86-64 System V psABIs can be found linked from this page in the ABI maintainer's repo. (See also the x86 tag wiki for up-to-date ABI links and lots of other good stuff about x86 asm.)
Here is the snippet from this section:
User-level applications use as integer registers for passing the
sequence %rdi, %rsi, %rdx, %rcx,
%r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.
A system-call is done via the syscall instruction. This clobbers %rcx and %r11 as well as the %rax return value, but other registers are preserved.
The number of the syscall has to be passed in register %rax.
System-calls are limited to six arguments, no argument is passed
directly on the stack.
Returning from the syscall, register %rax contains the result of
the system-call. A value in the range between -4095 and -1 indicates
an error, it is -errno.
Only values of class INTEGER or class MEMORY are passed to the kernel.
Remember this is from the Linux-specific appendix to the ABI, and even for Linux it's informative not normative. (But it is in fact accurate.)
This 32-bit int $0x80 ABI is usable in 64-bit code (but highly not recommended). What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? It still truncates its inputs to 32-bit, so it's unsuitable for pointers, and it zeros r8-r11.
User Interface: function calling
x86-32 Function Calling convention:
In x86-32 parameters were passed on stack. Last parameter was pushed first on to the stack until all parameters are done and then call instruction was executed. This is used for calling C library (libc) functions on Linux from assembly.
Modern versions of the i386 System V ABI (used on Linux) require 16-byte alignment of %esp before a call, like the x86-64 System V ABI has always required. Callees are allowed to assume that and use SSE 16-byte loads/stores that fault on unaligned. But historically, Linux only required 4-byte stack alignment, so it took extra work to reserve naturally-aligned space even for an 8-byte double or something.
Some other modern 32-bit systems still don't require more than 4 byte stack alignment.
x86-64 System V user-space Function Calling convention:
x86-64 System V passes args in registers, which is more efficient than i386 System V's stack args convention. It avoids the latency and extra instructions of storing args to memory (cache) and then loading them back again in the callee. This works well because there are more registers available, and is better for modern high-performance CPUs where latency and out-of-order execution matter. (The i386 ABI is very old).
In this new mechanism: First the parameters are divided into classes. The class of each parameter determines the manner in which it is passed to the called function.
For complete information refer to : "3.2 Function Calling Sequence" of System V Application Binary Interface AMD64 Architecture Processor Supplement which reads, in part:
Once arguments are classified, the registers get assigned (in
left-to-right order) for passing as follows:
If the class is MEMORY, pass the argument on the stack.
If the class is INTEGER, the next available register of the
sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used
So %rdi, %rsi, %rdx, %rcx, %r8 and %r9 are the registers in order used to pass integer/pointer (i.e. INTEGER class) parameters to any libc function from assembly. %rdi is used for the first INTEGER parameter. %rsi for 2nd, %rdx for 3rd and so on. Then call instruction should be given. The stack (%rsp) must be 16B-aligned when call executes.
If there are more than 6 INTEGER parameters, the 7th INTEGER parameter and later are passed on the stack. (Caller pops, same as x86-32.)
The first 8 floating point args are passed in %xmm0-7, later on the stack. There are no call-preserved vector registers. (A function with a mix of FP and integer arguments can have more than 8 total register arguments.)
Variadic functions (like printf) always need %al = the number of FP register args.
There are rules for when to pack structs into registers (rdx:rax on return) vs. in memory. See the ABI for details, and check compiler output to make sure your code agrees with compilers about how something should be passed/returned.
Note that the Windows x64 function calling convention has multiple significant differences from x86-64 System V, like shadow space that must be reserved by the caller (instead of a red-zone), and call-preserved xmm6-xmm15. And very different rules for which arg goes in which register.
Perhaps you're looking for the x86_64 ABI?
www.x86-64.org/documentation/abi.pdf (404 at 2018-11-24)
www.x86-64.org/documentation/abi.pdf (via Wayback Machine at 2018-11-24)
Where is the x86-64 System V ABI documented? - https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI is kept up to date (by HJ Lu, one of the ABI maintainers) with links to PDFs of the official current version.
If that's not precisely what you're after, use 'x86_64 abi' in your preferred search engine to find alternative references.
Linux kernel 5.0 source comments
I knew that x86 specifics are under arch/x86, and that syscall stuff goes under arch/x86/entry. So a quick git grep rdi in that directory leads me to arch/x86/entry/entry_64.S:
/*
* 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
*
* This is the only entry point used for 64-bit system calls. The
* hardware interface is reasonably well designed and the register to
* argument mapping Linux uses fits well with the registers that are
* available when SYSCALL is used.
*
* SYSCALL instructions can be found inlined in libc implementations as
* well as some other programs and libraries. There are also a handful
* of SYSCALL instructions in the vDSO used, for example, as a
* clock_gettimeofday fallback.
*
* 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
* then loads new ss, cs, and rip from previously programmed MSRs.
* rflags gets masked by a value from another MSR (so CLD and CLAC
* are not needed). SYSCALL does not save anything on the stack
* and does not change rsp.
*
* Registers on entry:
* rax system call number
* rcx return address
* r11 saved rflags (note: r11 is callee-clobbered register in C ABI)
* rdi arg0
* rsi arg1
* rdx arg2
* r10 arg3 (needs to be moved to rcx to conform to C ABI)
* r8 arg4
* r9 arg5
* (note: r12-r15, rbp, rbx are callee-preserved in C ABI)
*
* Only called from user space.
*
* When user can change pt_regs->foo always force IRET. That is because
* it deals with uncanonical addresses better. SYSRET has trouble
* with them due to bugs in both AMD and Intel CPUs.
*/
and for 32-bit at arch/x86/entry/entry_32.S:
/*
* 32-bit SYSENTER entry.
*
* 32-bit system calls through the vDSO's __kernel_vsyscall enter here
* if X86_FEATURE_SEP is available. This is the preferred system call
* entry on 32-bit systems.
*
* The SYSENTER instruction, in principle, should *only* occur in the
* vDSO. In practice, a small number of Android devices were shipped
* with a copy of Bionic that inlined a SYSENTER instruction. This
* never happened in any of Google's Bionic versions -- it only happened
* in a narrow range of Intel-provided versions.
*
* SYSENTER loads SS, ESP, CS, and EIP from previously programmed MSRs.
* IF and VM in RFLAGS are cleared (IOW: interrupts are off).
* SYSENTER does not save anything on the stack,
* and does not save old EIP (!!!), ESP, or EFLAGS.
*
* To avoid losing track of EFLAGS.VM (and thus potentially corrupting
* user and/or vm86 state), we explicitly disable the SYSENTER
* instruction in vm86 mode by reprogramming the MSRs.
*
* Arguments:
* eax system call number
* ebx arg1
* ecx arg2
* edx arg3
* esi arg4
* edi arg5
* ebp user stack
* 0(%ebp) arg6
*/
glibc 2.29 Linux x86_64 system call implementation
Now let's cheat by looking at a major libc implementations and see what they are doing.
What could be better than looking into glibc that I'm using right now as I write this answer? :-)
glibc 2.29 defines x86_64 syscalls at sysdeps/unix/sysv/linux/x86_64/sysdep.h and that contains some interesting code, e.g.:
/* The Linux/x86-64 kernel expects the system call parameters in
registers according to the following table:
syscall number rax
arg 1 rdi
arg 2 rsi
arg 3 rdx
arg 4 r10
arg 5 r8
arg 6 r9
The Linux kernel uses and destroys internally these registers:
return address from
syscall rcx
eflags from syscall r11
Normal function call, including calls to the system call stub
functions in the libc, get the first six parameters passed in
registers and the seventh parameter and later on the stack. The
register use is as follows:
system call number in the DO_CALL macro
arg 1 rdi
arg 2 rsi
arg 3 rdx
arg 4 rcx
arg 5 r8
arg 6 r9
We have to take care that the stack is aligned to 16 bytes. When
called the stack is not aligned since the return address has just
been pushed.
Syscalls of more than 6 arguments are not supported. */
and:
/* Registers clobbered by syscall. */
# define REGISTERS_CLOBBERED_BY_SYSCALL "cc", "r11", "cx"
#undef internal_syscall6
#define internal_syscall6(number, err, arg1, arg2, arg3, arg4, arg5, arg6) \
({ \
unsigned long int resultvar; \
TYPEFY (arg6, __arg6) = ARGIFY (arg6); \
TYPEFY (arg5, __arg5) = ARGIFY (arg5); \
TYPEFY (arg4, __arg4) = ARGIFY (arg4); \
TYPEFY (arg3, __arg3) = ARGIFY (arg3); \
TYPEFY (arg2, __arg2) = ARGIFY (arg2); \
TYPEFY (arg1, __arg1) = ARGIFY (arg1); \
register TYPEFY (arg6, _a6) asm ("r9") = __arg6; \
register TYPEFY (arg5, _a5) asm ("r8") = __arg5; \
register TYPEFY (arg4, _a4) asm ("r10") = __arg4; \
register TYPEFY (arg3, _a3) asm ("rdx") = __arg3; \
register TYPEFY (arg2, _a2) asm ("rsi") = __arg2; \
register TYPEFY (arg1, _a1) asm ("rdi") = __arg1; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4), \
"r" (_a5), "r" (_a6) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
which I feel are pretty self explanatory. Note how this seems to have been designed to exactly match the calling convention of regular System V AMD64 ABI functions: https://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions
Quick reminder of the clobbers:
cc means flag registers. But Peter Cordes comments that this is unnecessary here.
memory means that a pointer may be passed in assembly and used to access memory
For an explicit minimal runnable example from scratch see this answer: How to invoke a system call via syscall or sysenter in inline assembly?
Make some syscalls in assembly manually
Not very scientific, but fun:
x86_64.S
.text
.global _start
_start:
asm_main_after_prologue:
/* write */
mov $1, %rax /* syscall number */
mov $1, %rdi /* stdout */
mov $msg, %rsi /* buffer */
mov $len, %rdx /* len */
syscall
/* exit */
mov $60, %rax /* syscall number */
mov $0, %rdi /* exit status */
syscall
msg:
.ascii "hello\n"
len = . - msg
GitHub upstream.
Make system calls from C
Here's an example with register constraints: How to invoke a system call via syscall or sysenter in inline assembly?
aarch64
I've shown a minimal runnable userland example at: https://reverseengineering.stackexchange.com/questions/16917/arm64-syscalls-table/18834#18834 TODO grep kernel code here, should be easy.
Calling conventions defines how parameters are passed in the registers when calling or being called by other program. And the best source of these convention is in the form of ABI standards defined for each these hardware. For ease of compilation, the same ABI is also used by userspace and kernel program. Linux/Freebsd follow the same ABI for x86-64 and another set for 32-bit. But x86-64 ABI for Windows is different from Linux/FreeBSD. And generally ABI does not differentiate system call vs normal "functions calls".
Ie, here is a particular example of x86_64 calling conventions and it is the same for both Linux userspace and kernel: http://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/ (note the sequence a,b,c,d,e,f of parameters):
Performance is one of the reasons for these ABI (eg, passing parameters via registers instead of saving into memory stacks)
For ARM there is various ABI:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.swdev.abi/index.html
https://developer.apple.com/library/ios/documentation/Xcode/Conceptual/iPhoneOSABIReference/iPhoneOSABIReference.pdf
ARM64 convention:
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf
For Linux on PowerPC:
http://refspecs.freestandards.org/elf/elfspec_ppc.pdf
http://www.0x04.net/doc/elf/psABI-ppc64.pdf
And for embedded there is the PPC EABI:
http://www.freescale.com/files/32bit/doc/app_note/PPCEABI.pdf
This document is good overview of all the different conventions:
http://www.agner.org/optimize/calling_conventions.pdf

What makes read() a syscall?

The following link says that read is a syscall:
What is the difference between read() and fread()?
Now, I am trying to understand what makes read a system call.
For example:
I use Nuttx OS and registered a device structure flash_dev (path '/dev/flash0') with open, close and ioctl methods. This is added as a inode in pesudo file system with semaphore support for mutual exclusion.
Now, from application I open ('/dev/flash0') and do read & ioctls.
Now, which part in the above process makes read a syscall?
The read() function is a thin wrapper around whatever instructions are necessary to call into the system, IOW, to make a system call. When you call read() (and fread() call it as well), the relevant kernel/driver code gets invoked and does whatever is necessary to read from a file.
A system call is a call whose functionality lives almost entirely in the kernel rather than in user space. Traditionally, open(), read(), write(), etc, are in the kernel whereas fread(), fwrite(), etc, have code that runs in user space that calls into the kernel as needed.
For example, in Linux when you call read() the standard library your application linked against might do the following:
mov eax, 3 ;3 -> read
mov ebx, 2 ;file id
mov ecx, buffer
mov edx, 5 ;5 bytes
int 80h
That's it - it simply takes the parameters you passed in and invokes the kernel via the int 80h (interrupt) instruction. As an application programmer, it's not usually important whether the call runs in user space, in the kernel, or both. It can be important for debugging or performance reasons, but for simple applications it really doesn't matter much.

Is it possible to overwrite %eax using buffer overflow?

I know that a program stack looks somewhat like this (from high to low):
EIP | EBP | local variables
But where could I find %eax, and the other general registers? Is it possible to overwrite them using a buffer overflow?
Update: In the end, I did not even have to overwrite %eax, because it turned out that the program pointed %eax to the user input at some point.
A register, by definition, is not in RAM. Registers are in the CPU and do not have addresses, so you cannot overwrite them with a buffer overflow. However, there are very few registers, so the compiler really uses them as a kind of cache for the most used stack elements. This means that while you cannot overflow into registers stricto sensu, values overwritten in RAM will sooner or later be loaded into registers.
(In Sparc CPU, the registers-as-stack-cache policy is even hardwired.)
In your schema, EIP and EBP are not on the stack; the corresponding slots on the stack are areas from which these two registers will be reloaded (upon function exit). EAX, on the other hand, is a general purpose register that the code will use here and there, without a strict convention.
EAX will probably never appear on the stack. For most x86 compilers EAX is the 32-bit return-value register, and is never saved on the stack nor restored from the stack (RAX in 64-bit systems).
This is not to say that a buffer overflow cannot be used to alter the contents of EAX by putting executable code on the stack; if code execution on the stack has not been disabled by the OS, this can be done, or you can force a bogus return address on the stack which transfers control to a code sequence that loads a value into EAX, but these are quite difficult to pull off. Similarly, if the value returned by a function is known to be stored in a local variable, a stack smash that altered that variable would change the value copied to EAX, but optimizing compilers can change stack layout on a whim, so an exploit that works on one version may fail completely on a new release or hotfix.
See Thomas Pornin (+1) and flouder's (+1) answers, they are good. I want to add a supplementary answer that may have been alluded to but not specifically stated, and that is register spilling.
Though the "where" of the original question (at least as worded) appears to be based on the false premise that %eax is on the stack, and being a register, it isn't part of the stack on x86 (though you can emulate any hardware register set on a stack, and some architectures actually do this, but that isn't relevant), incidentally, registers are spilled/filled from the stack frequently. So it is possible to smash the value of a register with a stack overflow if the register has been spilled to the stack. This would require you to know the spilling mechanism of the particular compiler, and for that function call, you would need to know that %eax had been spilled, where it had been spilled to, and stomp that stack location, and when it is next filled from its memory copy, it gets the new value. As unlikely as this seems, these attacks are usually inspired by reading source code, and knowing something about the compiler in question, so it isn't really that far fetched.
See this for more reading on register spilling
gcc argument register spilling on x86-64
https://software.intel.com/en-us/articles/dont-spill-that-register-ensuring-optimal-performance-from-intrinsics

why is there severals encodings for one instruction in ARMv7

I am currently trying to implement a disassembler for the ARM cortex A9, which implement the ARMv7 instruction set.
For that I am using the manual "DDI0406C_b_arm_architecture_reference_manual.pdf" that can be download here (after having registered on arm website) :
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.architecture/index.html
In this manual, in the part A8.8 with the instructions details, I can't understand why there is several encoding for one instruction (like A1, A2, ...), that all seem to be implemented with ARMv7.
Also, as the ARM cortex A9 used thumb-2, does it also implement the A1/A2/... encodings, or only the T1/T2... ?
I really read all parts of this manual are related to encodings, but I still don't understand how we can know which encoding is used for a program.
Different encoding of an instruction do functionally different things.
One example for usage of different encodings is A8.9.12 ADR
This instruction adds an immediate value to the PC value to form a PC-relative address, and writes the result to the
destination register.
If instruction is encoded as A1 then offset must be interpreted as zero or positive, if it is encoded as A2 then offset is negative.
Another example is A8.8.132 POP
If the list contains more than one register, the instruction is assembled to encoding A1. If the list contains exactly one register, the instruction is assembled to encoding A2.
I can imagine different POP encodings are created probably to create different microcodes for performance reasons.
For the second part of your question, Cortex-A9 is an ARMv7-A architecture CPU and it supports all the instructions as specified in the manual you pointed. May be you should also read Cortex™-A9 Technical Reference Manual.
There is no way to really distinguish between ARM and Thumb inside the instruction-stream. You can only decide based on the way a function gets called (if the lowest bit is set to 1 then it's thumb, otherwise arm).
The ARM-Encoding are quite "stable", you'll only find a few A1 encodings, BLX is an example where a A2 encoding is given, but this is mainly because the new ARM-ARM is structured differently from the older ones. BL and BLX were two different instructions, BLX was added in additional instruction space (the upper 4 bits which are normaly used for conditions are set to 1111, which in ARM prior to v5 meant "never execute".
For the Thumb-Encodings it's different, there are a lot of them, because they had to be placed in a more compressed instruction-space, page A6-220 has information about how to decide whatever an thumb instruction consists of two or just one halfword.
The Ax encodings are arm when the processor is in arm mode it will decode bits it finds using those encodings. if there is more than one A1, A2, it should be obvious that there is a different feature or reason for that. those two instructions can be considered separate (look at the overuse of the mov instruction in x86 for example, it has many encodings). Treat each encoding as a separate "instruction".
Then there are the Tx variants, those are thumb and thumb2 extensions. The thumb are all 16 bit (the bl can be decoded as two separate 16 bit instructions) and the descriptions below them indicate "all thumb variants" or "armv4t to the present" or some such language. The thumb2 extensions are all 32 bit, the first 16 bits being an undefined instruction in the thumb world. These have more limitations on what architectures support them.
You are not going to be able to completely create a disassembler for one of these processors, for the same reason you cant do one for x86 or many other processors (all?). If you assume that all the instructions are one mode (arm or thumb or thumb+thumb2) but no mode mixing (arm+thumb) then you can because everything is fixed instruction length and you can simply disassemble everything data and code and you wont run into any problems. In order to disassemble mixed mode you have to basically emulate/execute the instructions and follow instruction flow (just like a variable word length instruction set disassembler) to try to find the transitions, problem here of course the transitions are multi instruction at a minimum load a register then bx that register, sometimes there is math involved on the instruction computation, and there is no guarantee that the address computation or load happens the instruction before the bx. So you could do some of that and get a long way through disassembling the program.
If thumb2 is supported/allowed on the processor you are using then you have the variable instruction length problem for times that you detect entry points to thumb code. And unless you already are doing this you have to follow execution of the code to determine where instructions start (elementary variable instruction length disassembly stuff).
The combination of technical reference manual and architectural reference manual will tell you if the architecture and implementation of that architecture (trm) allow arm and thumb modes. I would assume an A9 supports arm thumb and thumb2, all three.
The cortex-m family is the only one so far that is limited to not supporting arm, and their thumb2 varies widely as the cortex-m0 (and m1) are armv6m and the m3 and m4 are armv7m (A few dozen (armv6m) instructions to many dozen thumb2 extensions in the armv7m). There are separate architectural reference manuals specifically for the -m variants, armv7-m vs the armv7-ar manuals for example.