RISC-V: Immediate Encoding Variants - cpu-architecture

In the RISC-V Instruction Set Manual, User-Level ISA, I couldn't understand section 2.3 Immediate Encoding Variants page 11.
There is four types of instruction formats R, I, S, and U, then there is a variants of S and U types which are SB and UJ which I suppose mean Branch and Jump as shown in figure 2.3. Then there is the types of Immediate produced by RISC-V instructions shown in figure 2.4.
So my questions are, why the SB and UJ are needed? and why shuffle the Immediate bits in that way? what does it mean to say "the Immediate produced by RISC-V instructions"? and how are they produced in this manner?

To speed up decoding, the base RISC-V ISA puts the most important fields in the same place in every instruction. As you can see in the instruction formats table,
The major opcode is always in bits 0-6.
The destination register, when present, is always in bits 7-11.
The first source register, when present, is always in bits 15-19.
The second source register, when present, is always in bits 20-24.
The other bits are used for the minor opcode or other data for the instruction (funct3 in bits 12-14 and funct7 in bits 25-31), and for the immediate. How many bits can be used for the immediate depends on how many register numbers are present in the instruction:
Instructions with one destination and two source registers (R-type) have no immediate, for instance adding two registers (ADD);
Instructions with one destination and one source register (I-type) have 12 bits for the immediate, for instance adding one register with an immediate (ADDI);
Instructions with two source registers and no destination register (S-type), for instance the store instructions, have also 12 bits for the immediate, but they have to be in a different place since the register numbers are also in a different place;
Finally, instructions with only a destination register and no minor opcode (U-type), for instance LUI, can use 20 bits for the immediate (the major opcode and the destination register number together need 12 bits).
Now think from the other point of view, of the instructions which will use these immediate values. The simplest users, I-immediate and S-immediate, need only a sign-extended 12-bit value. The U-immediate instructions need the immediate in the upper 20 bits of a 32-bit value. Finally, the branch/jump instructions need the sign-extended immediate in the lower bits of the value, except for the lowest bit which will always be zero, since RISC-V instructions are always aligned to even addresses.
But why are the immediate bits shuffled? Think this time about the physical circuit which decodes the immediate field. Since it's a hardware implementation, the bits will be decoded in parallel; each bit in the output immediate will have a multiplexer to select which input bit it comes from. The bigger the multiplexer, the costlier and slower it is.
The "shuffling" of the immediate bits in the instruction encoding, therefore, is to make each output immediate bit have as little input instruction bit options as possible. For instance, immediate bit 1 can only come from instruction bits 8 (S-immediate or B-immediate), 21 (I-immediate or J-immediate), or constant zero (U-immediate or R-type instruction which has no immediate). Immediate bit 0 can come from instruction bits 7 (S-immediate), 20 (I-immediate), or constant zero. Immediate bit 5 can only come from instruction bit 25 or constant zero. And so on.
Instruction bit 31 is a special case: for RV-64, bits 32-63 of the immediate are always copies of instruction bit 31. This high fan-out adds a delay, which would be even bigger if it also needed a multiplexer, so it only has one option (other than constant zero, which can be treated later in the pipeline by ignoring the whole immediate).
It's also interesting to note that only the major opcode (bits 0-6) is needed to know how to decode the immediate, so immediate decoding can be done in parallel with decoding the rest of the instruction.
So, answering the questions:
SB-type doubles the range of branches, since instructions are always aligned to even addresses;
UJ-type has the same overall instruction format as U-type, but the immediate value is in the lower bits instead of the upper bits;
The immediate bits are shuffled to reduce the cost of decoding the immediate value, by reducing the number of choices for each output immediate bit;
The "immediate produced by RISC-V instructions" table shows the different kinds of immediate values which can be decoded from a RISC-V instruction, and from where in the instruction each bit comes from;
They are produced by, for each output immediate bit, using the major opcode (bits 0-6) to chose an input instruction bit.

The encoding is done to try and make the actual hardware implementation as simple as possible, rather than make it easy for the reader to understand at a glance.
In practice the compiler will generate the output and so it does not matter if it is not easy for the user to understand.
When possible the SB type tries to use the same bits for the same immediate bit positions as type S, that minimizes the hardware design complexity. So imm[4:1] and imm[10:5] are in the same place for both. The top most bit of the immediate values is always at position 31 so that you can use that bit to decide if a sign extension is needed. Again, this makes the hardware easier because for multiple types of instruction the top bit is used to decide on sign extension.

The RISC-V instruction encoding is chosen to simplify the decoder
2.2 Base Instruction Formats
The RISC-V ISA keeps the source (rs1 and rs2) and destination (rd) registers at the same position in all formats to simplify decoding. Except for the 5-bit immediates used in CSR instructions(Chapter 9), immediates are always sign-extended, and are generally packed towards the left most available bits in the instruction and have been allocated to reduce hardware complexity. In particular, the sign bit for all immediates is always in bit 31 of the instruction to speed sign-extension circuitry.
2.3 Immediate Encoding Variants
The only difference between the S and B formats is that the 12-bit immediate field is used to encode branch offsets in multiples of 2 in the B format. Instead of shifting all bits in the instruction-encoded immediate left by one in hardware as is conventionally done, the middle bits (imm[10:1]) and sign bit stay in fixed positions, while the lowest bit in S format (inst[7]) encodes a high-order bit in B format.
Similarly, the only difference between the U and J formats is that the 20-bit immediate is shiftedleft by 12 bits to form U immediates and by 1 bit to form J immediates. The location of instructionbits in the U and J format immediates is chosen to maximize overlap with the other formats andwith each other.
https://riscv.org/technical/specifications/
The reason for the shuffling of the immediate in SB/UL formats has also been explained in the RISC-V spec
Although more complex implementations might have separate adders for branch and jump calculations and so would not benefit from keeping the location of immediate bits constant across types of instruction, we wanted to reduce the hardware cost of the simplest implementations. By rotating bits in the instruction encoding of B and J immediates instead of using dynamic hard-ware muxes to multiply the immediate by 2, we reduce instruction signal fanout and immediate mux costs by around a factor of 2. The scrambled immediate encoding will add negligible timeto static or ahead-of-time compilation. For dynamic generation of instructions, there is some small additional overhead, but the most common short forward branches have straight forward immediate encodings.

Related

Orthogonality of Instruction Set Architecture

I am studying the difference between CISC and RISC recently, and I've encountered into the term "Orthogonality". After doing some research, my understanding so far is that there are two "axes", addressing modes & operations, which are independent of each other, so it produces a maximum number of (#addressing modes * #operations) instructions in the ISA.
For CISC machine, which is a register-memory architecture, operands may come from register or memory and RISC a register-register(or load-store) one on the contrary.
So, what's the role of orthogonality playing in these two ISA? Is CISC more orthogonal than RISC or vice versa?
As the wiki describes, "Modern CPUs often simulate orthogonality in a preprocessing step before performing the actual tasks in a RISC-like core. This "simulated orthogonality" in general is a broader concept, encompassing the notions of decoupling and completeness in function libraries, like in the mathematical concept: an orthogonal function set is easy to use as a basis into expanded functions, ensuring that parts don’t affect another if we change one part." What does this paragraph mean? What is the preprocessing step, does it have anything to do with the microcode?
Any explanation are appreciated! Thanks a lot!
Maximizing total choices of possible instructions like a CISC is generally not what's meant. Instead it's more about being a simpler compiler target, without complex interactions in what makes an instruction legal or not. RISC machines are often highly orthogonal, and designed with being a compiler target in mind, not human programmers.
My understanding of the term is that orthogonality is more about any register being usable in any case where any other register is usable. Unlike x86 shl reg, cl where variable-count shifts require a specific register. (I know this is a RISC-V question, but the examples of non-orthogonality I know of come from other ISAs, primarily x86.)
And definitely not like 8086 (before 386), where if you needed to multiply, one of the operands had to be in the accumulator, AL or AX. And sign-extension was also only available there. 386 introduced movsx reg, r/m8 and r/m16. (And movzx, allowing easy and more efficient zero-extending of a byte from memory into SI or DI, without having to load 2 bytes and and si, 0x00ff.)
Even worse, 16-bit addressing modes only allow a few registers in very limited ways: [bp|bx] + [si|di] + disp0/8/16, vs. 32-bit addressing modes allowing stuff like lea eax, [ecx + ecx + 3] to use the same register twice, or address memory relative to the stack pointer without having to copy it to the base pointer (BP) register.
Or if some memory operands can use a certain addressing mode, can all memory operands use it? AArch64 ldp/stp (load-pair/store-pair of registers) I think has fewer available addressing modes than single-register loads, because it needs 5 extra bits for a second register number. Unlike ARM32 ldrd where the pair of registers is two contiguous registers, starting with an even number.
In general, the less interaction there is between a choice of one thing (like instruction) and the possible choices for another (a register), the more orthogonal.
One of the major benefits with this is being a simple compiler target. The most optimal code can more often be found with a greedy algorithm that only takes into a account one thing at a time, not interlocking tradeoffs. Not like x86-64 "if I use ECX instead of R9d for this variable, that'll save bytes in multiple instructions not needing REX prefix, but later mean I need an extra mov to copy a register for a shift count". (x86 BMI2 introduced variable-count shifts that can use a count from any register, like shlx ebx, eax, r15d)
Or far worse targeting 8086 or 286, where 16-bit addressing modes impose a lot more constraints on register allocation. And you'd more often you'd want to use instructions that needed their operands in specific registers, especially the accumulator.
But if you're not worried about every byte of code size, x86-64 is a fairly orthogonal ISA, usually you don't need to care about which register you use for what. One change in that direction beyond 386's important changes was making the low byte of every register addressable, like bpl, spl, sil, dil as the low bytes of RBP, RSP, RSI, RDI. (But those require REX prefixes, overlapping encodings with AH/CH/DH/BH which are only usable in instructions without REX prefixes.)
Another example of non-orthogonality is x86's notorious integer SIMD extensions, MMX and SSE2. Want to do minimum of unsigned integers 16 bytes at a time? In SSE2 we have pminub for unsigned byte elements. And pminsw, signed 16-bit elements. But no other combination of size and signedness until SSE4.1, several years later, which filled in the gaps allowing signed bytes and u16, as well as i32 and u32. And then AVX-512 added i64 and u64. Every min available always had a corresponding max, but other than that, SSE2 was highly non-orthogonal in that and many other ways, including signed/unsigned saturating add/sub, and pack of wider to narrower elements with signed or unsigned saturation. And FP vs. integer shuffles, e.g. there's no integer equivalent to shufps that takes two elements from one vector, two from another, using an immediate control operand. Fortunately for shuffles you can use FP shuffles on integer data.
x86 SIMD is still not very orthogonal in many ways, for example in integer multiply where not all combinations of element size are available for everything; 16-bit has 16x16 => 16-bit low half, signed high half, or unsigned high half. (And a widening multiply and horizontal-add, pmaddwd). 32-bit has signed and unsigned widening 32x32 => 64-bit, and with SSE4.1 also non-widening. 8-bit only has a multiply and horizontal-add where one operand is treated as signed, the other as unsigned.
Again, if I'm picking on x86 a lot, it's because it's what I know. And Intel painted a huge "kick me" sign on their back when they designed MMX and SSE2, only taking some steps to fix things later with SSE4.1. (I'm sure there are reasons for some of those choices, including transistor budget and opcode coding-space in x86's notoriously cramped machine-code.) But a lot of programs don't want to assume SSE4.1 as a requirement to run at all, even now, over a decade since the first SSE4.1 CPUs. Most other SIMD ISAs are more orthogonal than x86, like ARM NEON or PowerPC AltiVec.
Anyway, in general, it's more orthogonal if all operations are available in all combinations of size and signedness that exist for any operation. This isn't always a big deal for compilers per-se, more for humans not realizing that a compiler could make their code faster if this variable was unsigned or something.
Modern CPUs often simulate orthogonality in a preprocessing step before performing the actual tasks in a RISC-like core
That sounds like they're talking about decoding to uops, but I don't see how that would gain orthogonality.
Unless they're counting the concept of any instruction allowing a memory source operand as being more orthogonal. Normally you wouldn't, being a load/store architecture is basically a fixed constraint that doesn't make other things harder.
But if you do consider that more orthogonal, then yes, decoding add eax, [rdi] to 2 uops lets it run on a back-end that separates the load work from the store work, like a RISC.
I hadn't heard this term orthogonal instruction set before, however:
The VAX is perhaps the epitome of CISC.  The VAX supports many addressing modes, ranging from register itself, to memory specified by various indexing computations (some including pointer advancement, so as to do *p++ or *--p).
The VAX allows all addressing modes for all operands of any instruction.  Further, the VAX allows both 2 operand and 3 operand instructions, so addl2 is operand2 += operand1, and addl3 is operand3 = operand1 + operand2.
Basically it can encode a lot of stuff in a single instruction, so we can do for example, a[i] = *p++ + b[j]; in one instruction, assuming a, b, i, j, and p are in registers.
Other CISC-style processors limit the encoding, for example, so that we can only do two-operand instructions (no 3 operand), and some even limit the 2nd operand to a register, so only one memory operand.  I believe this is what they're getting at with the term orthogonal or not.
Meanwhile, a RISC processor instead follows a load/store architecture.  Access to memory is not allowed for any operand, but rather only via load and store instructions, and only with those instructions are there addressing modes.  Most all arithmetic operations (except the add for addressing) happen between registers alone.  (In some sense the RISC philosophy has an orthogonality since all arithmetic operations work on registers alone.)
I don't think the term orthogonality is of high value.  I wouldn't dwell on the term itself, but rather take away from that article the comparison between CISC ala VAX, vs. others CISC, vs. RISC.
#Peter also makes a good points, such as that certain registers being hard code (i.e. an implicit source/target) in some architectures for some instructions, which reduces orthogonality.
By that point I might stress that RISC architectures generally don't hard code registers, though MIPS hard codes the return address register ($31) for the jal instruction whereas RISC V does not ($sp and $ra are hard coded but only in the compressed instruction extension).  Whereas some CISC architectures (except VAX) hard code more registers. 
The MC68000 divides the registers into two sets of 8: addressing registers and data registers, which helps encoding by providing 16 registers with only 3-bit register fields, but also limits what you can do with them (and there aren't enough address registers, since one is the stack pointer and another the global pointer, leaving only 6).
CISC architecture often support byte vs. word sized arithmetic, whereas RISC architectures usually support only word sized arithmetic, so if you want byte, you have to simulate it (i.e. with range check or other).

Why are registers restrained about the number of registers?

I wonder why register must be only 32.
I know vaguely about the reason but i want to know more exactly.
Let's look at what we have with 32 general purpose integer-oriented registers, and what would happen if we went to 256 registers:
Diminishing returns
Normal compiled code demonstrates that with 32 registers, most function leave some of the registers unused.  So, adding more registers than 32 doesn't help most code.
Encoding size
On a register machine, binary operators like addition, subtraction, comparison, others require three operands: left source, right source, and target.  On a RISC machine, each of these uses a register operand, so that means 3 register operands in one instruction.  This means that 3 x 5 bits = 15 bits are used in such an instruction on a machine with 32 registers.
If we were to increase the number of registers to, say 256, then we would need 8 bits for each register operand.  That would mean 3 x 8 bits = 24 bits.  Instructions become larger, and this decreases the efficiency of the instruction cache — a critical component to performance.
Many instruction sets do have more than 32 registers
They add specialized registers, such as a whole second set for floating point, and also another set of extra wide registers for SIMD and vector operations.
In context, these additional register sets don't necessarily suffer the same code expansion as described above because these additional register sets don't intermix with each other: in other words we can have 32 integer registers and also 32 floating point registers, and still maintain 5 bit register fields in the instructions, because the instructions involved know which register set they are using and don't support mixing of the register sets in the same instruction.
Also, to be clear, many instruction sets have used different numbers of registers, many less than 32 yet some more than 32.

32-1024 bit fixed point vector arithmetic with AVX-2

For a mandelbrot generator I want to used fixed point arithmetic going from 32 up to maybe 1024 bit as you zoom in.
Now normaly SSE or AVX is no help there due to the lack of add with carry and doing normal integer arithmetic is faster. But in my case I have literally millions of pixels that all need to be computed. So I have a huge vector of values that all need to go through the same iterative formula over and over a million times too.
So I'm not looking at doing a fixed point add/sub/mul on single values but doing it on huge vectors. My hope is that for such vector operations AVX/AVX2 can still be utilized to improve the performance despite the lack of native add with carry.
Anyone know of a library for fixed point arithmetic on vectors or some example code how to do emulate add with carry on AVX/AVX2.
FP extended precision gives more bits per clock cycle (because double FMA throughput is 2/clock vs. 32x32=>64-bit at 1 or 2/clock on Intel CPUs); consider using the same tricks that Prime95 uses with FMA for integer math. With care it's possible to use FPU hardware for bit-exact integer work.
For your actual question: since you want to do the same thing to multiple pixels in parallel, probably you want to do carries between corresponding elements in separate vectors, so one __m256i holds 64-bit chunks of 4 separate bigintegers, not 4 chunks of the same integer.
Register pressure is a problem for very wide integers with this strategy. Perhaps you can usefully branch on there being no carry propagation past the 4th or 6th vector of chunks, or something, by using vpmovmskb on the compare result to generate the carry-out after each add. An unsigned add has carry out of a+b < a (unsigned compare)
But AVX2 only has signed integer compares (for greater-than), not unsigned. And with carry-in, (a+b+c_in) == a is possible with b=carry_in=0 or with b=0xFFF... and carry_in=1 so generating carry-out is not simple.
To solve both those problems, consider using chunks with manual wrapping to 60-bit or 62-bit or something, so they're guaranteed to be signed-positive and so carry-out from addition appears in the high bits of the full 64-bit element. (Where you can vpsrlq ymm, 62 to extract it for addition into the vector of next higher chunks.)
Maybe even 63-bit chunks would work here so carry appears in the very top bit, and vmovmskpd can check if any element produced a carry. Otherwise vptest can do that with the right mask.
This is a handy-wavy kind of brainstorm answer; I don't have any plans to expand it into a detailed answer. If anyone wants to write actual code based on this, please post your own answer so we can upvote that (if it turns out to be a useful idea at all).
Just for kicks, without claiming that this will be actually useful, you can extract the carry bit of an addition by just looking at the upper bits of the input and output values.
unsigned result = a + b + last_carry; // add a, b and (optionally last carry)
unsigned carry = (a & b) // carry if both a AND b have the upper bit set
| // OR
((a ^ b) // upper bits of a and b are different AND
& ~r); // AND upper bit of the result is not set
carry >>= sizeof(unsigned)*8 - 1; // shift the upper bit to the lower bit
With SSE2/AVX2 this could be implemented with two additions, 4 logic operations and one shift, but works for arbitrary (supported) integer sizes (uint8, uint16, uint32, uint64). With AVX2 you'd need 7uops to get 4 64bit additions with carry-in and carry-out.
Especially since multiplying 64x64-->128 is not possible either (but would require 4 32x32-->64 products -- and some additions or 3 32x32-->64 products and even more additions, as well as special case handling), you will likely not be more efficient than with mul and adc (maybe unless register pressure is your bottleneck).As
As Peter and Mystical suggested, working with smaller limbs (still stored in 64 bits) can be beneficial. On the one hand, with some trickery, you can use FMA for 52x52-->104 products. And also, you can actually add up to 2^k-1 numbers of 64-k bits before you need to carry the upper bits of the previous limbs.

Indexed addressing mode and implied addressing mode

Indexed addressing mode is usually used for accessing arrays as arrays are stored contiguosly. We have a index register which gets incremented in every iteration which when added to base address gives the array element address.
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing ? We have the base address and we can just add 1 to it every time when accessing. Why do we need indexed addressing mode which has a overhead of index register ?
I am not sure about the instruction format for implied addressing mode. Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing?
You can; MIPS only has one addressing mode and compilers can still generate code for it just fine. But sometimes it has to use an extra shift + add instruction to calculate an address (if it's not just looping through an array).
The point of addressing modes is to save instructions and save registers, especially in 2-operand instruction sets like x86, where add eax, ecx overwrites eax with the result (eax += ecx), unlike MIPS or other 3-instruction ISAs where addu $t2, $t1, $t0 does t2 = t1 + t0. On x86, that would require a copy (mov) and an add. (Or in that special case, lea edx, [eax+ecx]: x86 can copy-and-add (and shift) using the same instruction-encoding it uses for memory operands.)
Consider a histogram problem: you generate array indices in unpredictable order, and have to index an array. On x86-64, add dword [rbx + rdi*4], 1 will increment a 32-bit counter in memory using a single 4-byte instruction, which decodes to only 2 uops for the front-end to issue into the out-of-order core on modern Intel CPUs. (http://agner.org/optimize/). (rbx is the base register, rdi is a scaled index). Having a scaled index is very powerful; x86 16-bit addressing modes support 2 registers, but not a scaled index.
Classic MIPS only has separate shift and add instructions, although MIPS32 did add a scaled-add instruction for address calculation. That would save an instruction here. Being a load-store machine, the loads and stores always have to be separate instructions (unlike on x86 where that add decodes as a micro-fused load+add and a store. See INC instruction vs ADD 1: Does it matter?).
Probably ARM would be a better comparison for MIPS: It's also a load-store RISC machine. But it does have a selection of addressing modes, including scaled index using the barrel shifter. So instead of needing a separate shift / add for each array index, you'd use LDR R0, [R1, R2, LSL #2], add r0, r0, #1 / str with the same addressing mode.
Often when looping through an array, it is best to just increment pointers on x86. But it's also an option to use an index, especially for loops with multiple arrays using the same index, like C[i] = A[i] + B[i]. Indexed addressing mode can sometimes be slightly less efficient in hardware, though, so when a compiler is unrolling a loop it usually should use pointers, even though it has to increment all 3 pointers separately instead of one index.
The point of instruction-set design is not merely to be Turing complete, it's to enable efficient code that gets more work done with fewer clock cycles and/or smaller code-size, or give programmers the option of aiming for either of those goals.
The minimum threshold for a computer to be programmable is extremely low, see for example various One instruction set computer architectures. (None implemented for real, just designed on paper to show that it's possible to write programs with nothing but a subtract-and-branch-if-less-than-zero instruction, with memory operands encoded in the instruction.
There's a tradeoff between easy to decode (especially to decode in parallel) vs. compact. x86 is horrible because it evolved as a series of extensions, often without a lot of planning to leave room for future extensions. If you're interested in ISA design decisions, have a look at Agner Fog's blog for interesting discussion about designing an ISA for high-performance CPUs that combines the best of x86 (lots of work with one instruction, e.g. memory operand as part of an ALU instruction) with the best features of RISC (easy to decode, lots of registers): Proposal for an ideal extensible instruction set.
There's also a tradeoff in how you spend the bits in an instruction word, especially in a fixed instruction width ISA like most RISCs. Different ISAs made different choices.
PowerPC uses lots of the coding space for powerful bitfield instructions like rlwinm (rotate left and mask off a window of bits), and lots of opcodes. IDK if the generally unpronounceable and hard-to-remember mnemonics are related to that...
ARM uses the high 4 bits for predicated execution of any instruction based on condition codes. It uses more bits for the barrel shifter (the 2nd source operand is optionally shifted or rotated by an immediate or a count from another register).
MIPS has relatively large immediate operands, and is basically simple.
x86 32/64-bit addressing modes use a variable-length encoding, with an extra byte SIB (scale/index/base) byte when there's an index, and an optional disp8 or disp32 immediate displacement. (e.g. add esi, [rax + rdx + 12340] takes 2 + 1 + 4 bytes to encode, vs. 2 bytes for add esi, [rax].
x86 16-bit addressing modes are much more limited, and pack everything except the optional disp8/disp16 displacement into the ModR/M byte.
Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
Yes, the machine-code format for some instructions in some ISAs includes implicit operands. Many machines have push / pop instructions that implicitly use a specific register as the stack pointer. For example, in x86-64's push rax, RAX is an explicit register operand (encoded in the low 3 bits of the one-byte opcode using the push r64 short form), while RSP is an implicit operand.
Older 8-bit CPUs often had instructions like DECA (to decrement the accumulator, A). i.e. there was a specific opcode for that register. This could be the same thing as having a DEC instruction with some bits in the opcode byte specifying which register (like x86 does before x86-64 repurposed the short INC/DEC encodings as REX prefixes: note the "N.E" (Not Encodeable) in the 64-bit mode column for dec r32). But if there's no regular pattern then it can definitely be considered an implicit operand.
Sometimes putting things into neat categories breaks down, so don't worry too much about whether using bits with the opcode byte counts as implicit or explicit for x86. It's a way of spending more opcode space to save code-size for commonly used instructions while still allowing use with different registers.
Some ISAs only use a certain register as the stack pointer by convention, with no implicit uses. MIPS is like this.
ARM32 (in ARM, not Thumb mode) also uses explicit operands in push/pop. Its push/pop mnemonics are just aliases for store-multiple decrement-before / load-multiple increment-after (LDMIA / STMDB) to implement a full-descending stack. See ARM's docs for LDM/STM which explains this, and what you can do with the general case of these instructions, e.g. LDMDB to decrement a pointer and then load (in the opposite direction of POP).

What is the advantage of having instructions in a uniform format?

Many processors have instructions which are of uniform format and width such as the ARM where all instructions are 32-bit long. other processors have instructions in multiple widths of say 2, 3, or 4 bytes long, such as 8086.
What is the advantage of having all instructions the same width and in a uniform format?
What is the advantage of having instructions in multiple widths?
Fixed Length Instruction Trade-offs
The advantages of fixed length instructions with a relatively uniform formatting is that fetching and parsing the instructions is substantially simpler.
For an implementation that fetches a single instruction per cycle, a single aligned memory (cache) access of the fixed size is guaranteed to provide one (and only one) instruction, so no buffering or shifting is required. There is also no concern about crossing a cache line or page boundary within a single instruction.
The instruction pointer is incremented by a fixed amount (except when executing control flow instructions--jumps and branches) independent of the instruction type, so the location of the next sequential instruction can be available early with minimal extra work (compared to having to at least partially decode the instruction). This also makes fetching and parsing more than one instruction per cycle relatively simple.
Having a uniform format for each instruction allows trivial parsing of the instruction into its components (immediate value, opcode, source register names, destination register name). Parsing out the source register names is the most timing critical; with these in fixed positions it is possible to begin reading the register values before the type of instruction has been determined. (This register reading is speculative since the operation might not actually use the values, but this speculation does not require any special recovery in the case of mistaken speculation but does take extra energy.) In the MIPS R2000's classic 5-stage pipeline, this allowed reading of the register values to be started immediately after instruction fetch providing half of a cycle to compare register values and resolve a branch's direction; with a (filled) branch delay slot this avoided stalls without branch prediction.
(Parsing out the opcode is generally a little less timing critical than source register names, but the sooner the opcode is extracted the sooner execution can begin. Simple parsing out of the destination register name makes detecting dependencies across instructions simpler; this is perhaps mainly helpful when attempting to execute more than one instruction per cycle.)
In addition to providing the parsing sooner, simpler encoding makes parsing less work (energy use and transistor logic).
A minor advantage of fixed length instructions compared to typical variable length encodings is that instruction addresses (and branch offsets) use fewer bits. This has been exploited in some ISAs to provide a small amount of extra storage for mode information. (Ironically, in cases like MIPS/MIPS16, to indicate a mode with smaller or variable length instructions.)
Fixed length instruction encoding and uniform formatting do have disadvantages. The most obvious disadvantage is relatively low code density. Instruction length cannot be set according to frequency of use or how much distinct information is required. Strict uniform formatting would also tend to exclude implicit operands (though even MIPS uses an implicit destination register name for the link register) and variable-sized operands (most RISC variable length encodings have short instructions that can only access a subset of the total number of registers).
(In a RISC-oriented ISA, this has the additional minor issue of not allowing more work to be bundled into an instruction to equalize the amount of information required by the instruction.)
Fixed length instructions also make using large immediates (constant operands included in the instruction) more difficult. Classic RISCs limited immediate lengths to 16-bits. If the constant is larger, it must either be loaded as data (which means an extra load instruction with its overhead of address calculation, register use, address translation, tag check, etc.) or a second instruction must provide the rest of the constant. (MIPS provides a load high immediate instruction, partially under the assumption that large constants are mainly used to load addresses which will later be used for accessing data in memory. PowerPC provides several operations using high immediates, allowing, e.g., the addition of a 32-bit immediate in two instructions.) Using two instructions is obviously more overhead than using a single instruction (though a clever implementation could fuse the two instructions in the front-end [What Intel calls macro-op fusion]).
Fixed length instructions also makes it more difficult to extend an instruction set while retaining binary compatibility (and not requiring addition modes of operation). Even strictly uniform formatting can hinder extension of an instruction set, particularly for increasing the number of registers available.
Fujitsu's SPARC64 VIIIfx is an interesting example. It uses a two-bit opcode (in its 32-bit instructions) to indicate a loading of a special register with two 15-bit instruction extensions for the next two instructions. These extensions provide extra register bits and indication of SIMD operation (i.e., extending the opcode space of the instruction to which the extension is applied). This means that the full register name of an instruction not only is not entirely in a fixed position, but not even in the same "instruction". (Similarities to x86's REX prefix--which provides bits to extend register names encoded in the main part of the instruction--might be noted.)
(One aspect of fixed length encodings is the tyranny of powers of two. Although it is possible to used non-power-of-two instruction lengths [Tensilica's XTensa now has fixed 24-bit instructions as its base ISA--with 16-bit short instruction support being an extension, previously they were part of the base ISA; IBM had an experimental ISA with 40-bit instructions.], such adds a little complexity. If one size, e.g., 32bits, is a little too short, the next available size, e.g., 64 bits, is likely to be too long, sacrificing too much code density.)
For implementations with deep pipelines the extra time required for parsing instructions is less significant. The extra dynamic work done by hardware and the extra design complexity are reduced in significance for high performance implementations which add sophisticated branch prediction, out-of-order execution, and other features.
Variable Length Instruction Trade-offs
For variable length instructions, the trade-offs are essentially reversed.
Greater code density is the most obvious advantage. Greater code density can improve static code size (the amount of storage needed for a given program). This is particularly important for some embedded systems, especially microcontrollers, since it can be a large fraction of the system cost and influence the system's physical size (which has impact on fitness for purpose and manufacturing cost).
Improving dynamic code size reduces the amount of bandwidth used to fetch instructions (both from memory and from cache). This can reduce cost and energy use and can improve performance. Smaller dynamic code size also reduces the size of caches needed for a given hit rate; smaller caches can use less energy and less chip area and can have lower access latency.
(In a non- or minimally pipelined implementation with a narrow memory interface, fetching only a portion of an instruction in a cycle in some cases does not hurt performance as much as it would in a more pipelined design less limited by fetch bandwidth.)
With variable length instructions, large constants can be used in instructions without requiring all instructions to be large. Using an immediate rather than loading a constant from data memory exploits spatial locality, provides the value earlier in the pipeline, avoids an extra instruction, and removed a data cache access. (A wider access is simpler than multiple accesses of the same total size.)
Extending the instruction set is also generally easier given support for variable length instructions. Addition information can be included by using extra long instructions. (In the case of some encoding techniques--particularly using prefixes--, it is also possible to add hint information to existing instructions allowing backward compatibility with additional new information. x86 has exploited this not only to provide branch hints [which are mostly unused] but also the Hardware Lock Elision extension. For a fixed length encoding, it would be difficult to choose in advance which operations should have additional opcodes reserved for possible future addition of hint information.)
Variable length encoding clearly makes finding the start of the next sequential instruction more difficult. This is somewhat less of a problem for implementations
that only decode one instruction per cycle, but even in that case it adds extra
work for the hardware (which can increase cycle time or pipeline length as well as use more energy). For wider decode several tricks are available to reduce the cost of parsing out individual instructions from a block of instruction memory.
One technique that has mainly been used microarchitecturally (i.e., not included in the interface exposed to software but only an implementation technique) is to use marker bits to indicate the start or end of an instruction. Such marker bits would be set for each parcel of instruction encoding and stored in the instruction cache. Such delays the availability of such information on a instruction cache miss, but this delay is typically small compared to the ordinary delay in filling a cache miss. The extra (pre)decoding work is only needed on a cache miss, so time and energy is saved in the common case of a cache hit (at the cost of some extra storage and bandwidth which has some energy cost).
(Several AMD x86 implementations have used marker bit techniques.)
Alternatively, marker bits could be included in the instruction encoding. This places some constrains on opcode assignment and placement since the marker bits effectively become part of the opcode.
Another technique, used by the IBM zSeries (S/360 and descendants), is to encode the instruction length in a simple way in the opcode in the first parcel. The zSeries uses two bits to encode three different instruction lengths (16, 32, and 48 bits) with two encodings used for 16 bit length. By placing this in a fixed position, it is relatively easy to quickly determine where the next sequential instruction begins.
(More aggressive predecoding is also possible. The Pentium 4 used a trace cache containing fixed-length micro-ops and recent Intel processors use a micro-op cache with [presumably] fixed-length micro-ops.)
Obviously, variable length encodings require addressing at the granularity of a parcel which is typically smaller than an instruction for a fixed-length ISA. This means that branch offsets either lose some range or must use more bits. This can be compensated by support for more different immediate sizes.
Likewise, fetching a single instruction can be more complex since the start of the instruction is likely to not be aligned to a larger power of two. Buffering instruction fetch reduces the impact of this, but adds (trivial) delay and complexity.
With variable length instructions it is also more difficult to have uniform encoding. This means that part of the opcode must often be decoded before the basic parsing of the instruction can be started. This tends to delay the availability of register names and other, less critical information. Significant uniformity can still be obtained, but it requires more careful design and weighing of trade-offs (which are likely to change over the lifetime of the ISA).
As noted earlier, with more complex implementations (deeper pipelines, out-of-order execution, etc.), the extra relative complexity of handling variable length instructions is reduced. After instruction decode, a sophisticated implementation of an ISA with variable length instructions tends to look very similar to one of an ISA with fixed length instructions.
It might also be noted that much of the design complexity for variable length instructions is a one-time cost; once an organization has learned techniques (including the development of validation software) to handle the quirks, the cost of this complexity is lower for later implementations.
Because of the code density concerns for many embedded systems, several RISC ISAs provide variable length encodings (e.g., microMIPS, Thumb2). These generally only have two instruction lengths, so the additional complexity is constrained.
Bundling as a Compromise Design
One (sort of intermediate) alternative chosen for some ISAs is to use a fixed length bundle of instructions with different length instructions. By containing instructions in a bundle, each bundle has the advantages of a fixed length instruction and the first instruction in each bundle has a fixed, aligned starting position. The CDC 6600 used 60-bit bundles with 15-bit and 30-bit operations. The M32R uses 32-bit bundles with 16-bit and 32-bit instructions.
(Itanium uses fixed length power-of-two bundles to support non-power of two [41-bit] instructions and has a few cases where two "instructions" are joined to allow 64-bit immediates. Heidi Pan's [academic] Heads and Tails encoding used fixed length bundles to encode fixed length base instruction parts from left to right and variable length chunks from right to left.)
Some VLIW instruction sets use a fixed size instruction word but individual operation slots within the word can be a different (but fixed for the particular slot) length. Because different operation types (corresponding to slots) have different information requirements, using different sizes for different slots is sensible. This provides the advantages of fixed size instructions with some code density benefit. (In addition, a slot might be allocated to optionally provide an immediate to one of the operations in the instruction word.)