How is a relative JMP (x86) implemented in an Assembler?

How is a relative JMP (x86) implemented in an Assembler? - encoding

While building my assembler for the x86 platform I encountered some problems with encoding the JMP instruction:
OPCODE INSTRUCTION SIZE
EB cb JMP rel8 2
E9 cw JMP rel16 4 (because of 0x66 16-bit prefix)
E9 cd JMP rel32 5
...
(from my favourite x86 instruction website, http://siyobik.info/index.php?module=x86&id=147)
All are relative jumps, where the size of each encoding (operation + operand) is in the third column.
Now my original (and thus fault because of this) design reserved the maximum (5 bytes) space for each instruction. The operand is not yet known, because it's a jump to a yet unknown location. So I've implemented a "rewrite" mechanism, that rewrites the operands in the correct location in memory, if the location of the jump is known, and fills the rest with NOPs. This is a somewhat serious concern in tight-loops.
Now my problem is with the following situation:
b: XXX
c: JMP a
e: XXX
...
XXX
d: JMP b
a: XXX (where XXX is any instruction, depending
on the to-be assembled program)
The problem is that I want the smallest possible encoding for a JMP instruction (and no NOP filling).
I have to know the size of the instruction at c before I can calculate the relative distance between a and b for the operand at d. The same applies for the JMP at c: it needs to know the size of d before it can calculate the relative distance between e and a.
How do existing assemblers solve this problem, or how would you do this?
This is what I am thinking which solves the problem:
First encode all the instructions to opcodes between the JMP and it's target, if this region contains a variable-sized opcode, use the maximum size, e.g. 5 for a JMP. Then encode the relative JMP to it's target, by choosing the smallest possible encoding size (3, 4 or 5) and calculate the distance. If any variable-sized opcode is encoded, change all absolute operands before, and all relative instructions that skips over this encoded instruction: they are re-encoded when their operand changes to choose the smallest possible size. This method is guaranteed to end, as variable-sized opcodes only may shrink (because it uses the maximum size of them).
I wonder, perhaps this is an over-engineered solution, that's why I ask this question.

In the first pass you will have a very good approximation to which jmp code to use using a pessimistic byte counting for all jump instructions.
On the second pass you can fill in the jumps with the pessimistic opcode chosen. Very few jumps could then be rewritten to use a byte or two less, only those that were very close to the 8/16 bit or 16/32 byte jump threshold originally. As the candidates are all jumps of many bytes, they are less likely to be in critical loop situations so you are likely to find that further passes offer little or no benefit over a two pass solution.

Here's one approach I've used that may seem inefficient but turns out not to be for most real-life code (pseudo-code):
IP := 0;
do
{
done = true;
while (IP < length)
{
if Instr[IP] is jump
if backwards
{ Target known
Encode short/long as needed }
else
{ Target unknown
if (!marked as needing long encoding) // see below
Encode short
Record location for fixup }
IP++;
}
foreach Fixup do
if Jump > short
Mark Jump location as requiring long encoding
PC := FixupLocation; // restart at instruction that needs size change
done = false;
break; // out of foreach fixup
else
encode jump
} while (!done);

Related

Implementation of zero flag without zero flag in program status word

In a typical processor’s PSW, zero flag is not implemented, however, it has Carry, Sign, Parity, and Overflow flags. In this architecture how would a programmer implement JZ (jump on zero).

You can't implement JZ after any arbitrary instruction like add reg, reg if there is no zero flag, none of the other flags carry the same information. e.g. 8-bit -128 + -128 overflows and carries to 0, but you can't distinguish that from -128 + -1 that overflows / carries to 127. Or various other combinations that you can't distinguish even with the help of SF and PF.
That's why we have a Zero Flag in normal ISAs, including 8080 or x86 whose flags and mnemonics you're using.
Did you actually just want to emulate x86 test eax,eax / jz or ARM cbz reg, target (conditional-branch on a register being zero) to test a register and jump if it was zero?
Note that 0 is the only number unsigned-below 1, so you can cmp / jnc. This looks like homework so I'm not going to spell it out more than that.
Or do what MIPS does, and provide an instruction like beq $reg, $reg, target that you can use to compare-and-branch on any pair of regs. (MIPS doesn't have a PSW / FLAGS at all). MIPS has an architectural zero register that always reads as zero, so you can always branch on any other register being zero with one machine instruction.
ARM Thumb, and AArch64, provide a limited version of that: cbz and cbnz that compare/branch on a single register being zero or non-zero, separate from the ARM flags register.
But really if you're going to have a FLAGS / PSW register at all, implement a zero flag. That's one of the most basic useful things. Although to be fair, a carry flag is even more critical. If you could only have one flag, it would probably be carry because you can still test for zero efficiently. Signed compares for greater or less are harder to emulate with SF and OF, though.

32-1024 bit fixed point vector arithmetic with AVX-2

For a mandelbrot generator I want to used fixed point arithmetic going from 32 up to maybe 1024 bit as you zoom in.
Now normaly SSE or AVX is no help there due to the lack of add with carry and doing normal integer arithmetic is faster. But in my case I have literally millions of pixels that all need to be computed. So I have a huge vector of values that all need to go through the same iterative formula over and over a million times too.
So I'm not looking at doing a fixed point add/sub/mul on single values but doing it on huge vectors. My hope is that for such vector operations AVX/AVX2 can still be utilized to improve the performance despite the lack of native add with carry.
Anyone know of a library for fixed point arithmetic on vectors or some example code how to do emulate add with carry on AVX/AVX2.

FP extended precision gives more bits per clock cycle (because double FMA throughput is 2/clock vs. 32x32=>64-bit at 1 or 2/clock on Intel CPUs); consider using the same tricks that Prime95 uses with FMA for integer math. With care it's possible to use FPU hardware for bit-exact integer work.
For your actual question: since you want to do the same thing to multiple pixels in parallel, probably you want to do carries between corresponding elements in separate vectors, so one __m256i holds 64-bit chunks of 4 separate bigintegers, not 4 chunks of the same integer.
Register pressure is a problem for very wide integers with this strategy. Perhaps you can usefully branch on there being no carry propagation past the 4th or 6th vector of chunks, or something, by using vpmovmskb on the compare result to generate the carry-out after each add. An unsigned add has carry out of a+b < a (unsigned compare)
But AVX2 only has signed integer compares (for greater-than), not unsigned. And with carry-in, (a+b+c_in) == a is possible with b=carry_in=0 or with b=0xFFF... and carry_in=1 so generating carry-out is not simple.
To solve both those problems, consider using chunks with manual wrapping to 60-bit or 62-bit or something, so they're guaranteed to be signed-positive and so carry-out from addition appears in the high bits of the full 64-bit element. (Where you can vpsrlq ymm, 62 to extract it for addition into the vector of next higher chunks.)
Maybe even 63-bit chunks would work here so carry appears in the very top bit, and vmovmskpd can check if any element produced a carry. Otherwise vptest can do that with the right mask.
This is a handy-wavy kind of brainstorm answer; I don't have any plans to expand it into a detailed answer. If anyone wants to write actual code based on this, please post your own answer so we can upvote that (if it turns out to be a useful idea at all).

Just for kicks, without claiming that this will be actually useful, you can extract the carry bit of an addition by just looking at the upper bits of the input and output values.
unsigned result = a + b + last_carry; // add a, b and (optionally last carry)
unsigned carry = (a & b) // carry if both a AND b have the upper bit set
| // OR
((a ^ b) // upper bits of a and b are different AND
& ~r); // AND upper bit of the result is not set
carry >>= sizeof(unsigned)*8 - 1; // shift the upper bit to the lower bit
With SSE2/AVX2 this could be implemented with two additions, 4 logic operations and one shift, but works for arbitrary (supported) integer sizes (uint8, uint16, uint32, uint64). With AVX2 you'd need 7uops to get 4 64bit additions with carry-in and carry-out.
Especially since multiplying 64x64-->128 is not possible either (but would require 4 32x32-->64 products -- and some additions or 3 32x32-->64 products and even more additions, as well as special case handling), you will likely not be more efficient than with mul and adc (maybe unless register pressure is your bottleneck).As
As Peter and Mystical suggested, working with smaller limbs (still stored in 64 bits) can be beneficial. On the one hand, with some trickery, you can use FMA for 52x52-->104 products. And also, you can actually add up to 2^k-1 numbers of 64-k bits before you need to carry the upper bits of the previous limbs.

Indexed addressing mode and implied addressing mode

Indexed addressing mode is usually used for accessing arrays as arrays are stored contiguosly. We have a index register which gets incremented in every iteration which when added to base address gives the array element address.
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing ? We have the base address and we can just add 1 to it every time when accessing. Why do we need indexed addressing mode which has a overhead of index register ?
I am not sure about the instruction format for implied addressing mode. Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?

I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing?
You can; MIPS only has one addressing mode and compilers can still generate code for it just fine. But sometimes it has to use an extra shift + add instruction to calculate an address (if it's not just looping through an array).
The point of addressing modes is to save instructions and save registers, especially in 2-operand instruction sets like x86, where add eax, ecx overwrites eax with the result (eax += ecx), unlike MIPS or other 3-instruction ISAs where addu $t2, $t1, $t0 does t2 = t1 + t0. On x86, that would require a copy (mov) and an add. (Or in that special case, lea edx, [eax+ecx]: x86 can copy-and-add (and shift) using the same instruction-encoding it uses for memory operands.)
Consider a histogram problem: you generate array indices in unpredictable order, and have to index an array. On x86-64, add dword [rbx + rdi*4], 1 will increment a 32-bit counter in memory using a single 4-byte instruction, which decodes to only 2 uops for the front-end to issue into the out-of-order core on modern Intel CPUs. (http://agner.org/optimize/). (rbx is the base register, rdi is a scaled index). Having a scaled index is very powerful; x86 16-bit addressing modes support 2 registers, but not a scaled index.
Classic MIPS only has separate shift and add instructions, although MIPS32 did add a scaled-add instruction for address calculation. That would save an instruction here. Being a load-store machine, the loads and stores always have to be separate instructions (unlike on x86 where that add decodes as a micro-fused load+add and a store. See INC instruction vs ADD 1: Does it matter?).
Probably ARM would be a better comparison for MIPS: It's also a load-store RISC machine. But it does have a selection of addressing modes, including scaled index using the barrel shifter. So instead of needing a separate shift / add for each array index, you'd use LDR R0, [R1, R2, LSL #2], add r0, r0, #1 / str with the same addressing mode.
Often when looping through an array, it is best to just increment pointers on x86. But it's also an option to use an index, especially for loops with multiple arrays using the same index, like C[i] = A[i] + B[i]. Indexed addressing mode can sometimes be slightly less efficient in hardware, though, so when a compiler is unrolling a loop it usually should use pointers, even though it has to increment all 3 pointers separately instead of one index.
The point of instruction-set design is not merely to be Turing complete, it's to enable efficient code that gets more work done with fewer clock cycles and/or smaller code-size, or give programmers the option of aiming for either of those goals.
The minimum threshold for a computer to be programmable is extremely low, see for example various One instruction set computer architectures. (None implemented for real, just designed on paper to show that it's possible to write programs with nothing but a subtract-and-branch-if-less-than-zero instruction, with memory operands encoded in the instruction.
There's a tradeoff between easy to decode (especially to decode in parallel) vs. compact. x86 is horrible because it evolved as a series of extensions, often without a lot of planning to leave room for future extensions. If you're interested in ISA design decisions, have a look at Agner Fog's blog for interesting discussion about designing an ISA for high-performance CPUs that combines the best of x86 (lots of work with one instruction, e.g. memory operand as part of an ALU instruction) with the best features of RISC (easy to decode, lots of registers): Proposal for an ideal extensible instruction set.
There's also a tradeoff in how you spend the bits in an instruction word, especially in a fixed instruction width ISA like most RISCs. Different ISAs made different choices.
PowerPC uses lots of the coding space for powerful bitfield instructions like rlwinm (rotate left and mask off a window of bits), and lots of opcodes. IDK if the generally unpronounceable and hard-to-remember mnemonics are related to that...
ARM uses the high 4 bits for predicated execution of any instruction based on condition codes. It uses more bits for the barrel shifter (the 2nd source operand is optionally shifted or rotated by an immediate or a count from another register).
MIPS has relatively large immediate operands, and is basically simple.
x86 32/64-bit addressing modes use a variable-length encoding, with an extra byte SIB (scale/index/base) byte when there's an index, and an optional disp8 or disp32 immediate displacement. (e.g. add esi, [rax + rdx + 12340] takes 2 + 1 + 4 bytes to encode, vs. 2 bytes for add esi, [rax].
x86 16-bit addressing modes are much more limited, and pack everything except the optional disp8/disp16 displacement into the ModR/M byte.
Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
Yes, the machine-code format for some instructions in some ISAs includes implicit operands. Many machines have push / pop instructions that implicitly use a specific register as the stack pointer. For example, in x86-64's push rax, RAX is an explicit register operand (encoded in the low 3 bits of the one-byte opcode using the push r64 short form), while RSP is an implicit operand.
Older 8-bit CPUs often had instructions like DECA (to decrement the accumulator, A). i.e. there was a specific opcode for that register. This could be the same thing as having a DEC instruction with some bits in the opcode byte specifying which register (like x86 does before x86-64 repurposed the short INC/DEC encodings as REX prefixes: note the "N.E" (Not Encodeable) in the 64-bit mode column for dec r32). But if there's no regular pattern then it can definitely be considered an implicit operand.
Sometimes putting things into neat categories breaks down, so don't worry too much about whether using bits with the opcode byte counts as implicit or explicit for x86. It's a way of spending more opcode space to save code-size for commonly used instructions while still allowing use with different registers.
Some ISAs only use a certain register as the stack pointer by convention, with no implicit uses. MIPS is like this.
ARM32 (in ARM, not Thumb mode) also uses explicit operands in push/pop. Its push/pop mnemonics are just aliases for store-multiple decrement-before / load-multiple increment-after (LDMIA / STMDB) to implement a full-descending stack. See ARM's docs for LDM/STM which explains this, and what you can do with the general case of these instructions, e.g. LDMDB to decrement a pointer and then load (in the opposite direction of POP).

RISC-V: Immediate Encoding Variants

In the RISC-V Instruction Set Manual, User-Level ISA, I couldn't understand section 2.3 Immediate Encoding Variants page 11.
There is four types of instruction formats R, I, S, and U, then there is a variants of S and U types which are SB and UJ which I suppose mean Branch and Jump as shown in figure 2.3. Then there is the types of Immediate produced by RISC-V instructions shown in figure 2.4.
So my questions are, why the SB and UJ are needed? and why shuffle the Immediate bits in that way? what does it mean to say "the Immediate produced by RISC-V instructions"? and how are they produced in this manner?

To speed up decoding, the base RISC-V ISA puts the most important fields in the same place in every instruction. As you can see in the instruction formats table,
The major opcode is always in bits 0-6.
The destination register, when present, is always in bits 7-11.
The first source register, when present, is always in bits 15-19.
The second source register, when present, is always in bits 20-24.
The other bits are used for the minor opcode or other data for the instruction (funct3 in bits 12-14 and funct7 in bits 25-31), and for the immediate. How many bits can be used for the immediate depends on how many register numbers are present in the instruction:
Instructions with one destination and two source registers (R-type) have no immediate, for instance adding two registers (ADD);
Instructions with one destination and one source register (I-type) have 12 bits for the immediate, for instance adding one register with an immediate (ADDI);
Instructions with two source registers and no destination register (S-type), for instance the store instructions, have also 12 bits for the immediate, but they have to be in a different place since the register numbers are also in a different place;
Finally, instructions with only a destination register and no minor opcode (U-type), for instance LUI, can use 20 bits for the immediate (the major opcode and the destination register number together need 12 bits).
Now think from the other point of view, of the instructions which will use these immediate values. The simplest users, I-immediate and S-immediate, need only a sign-extended 12-bit value. The U-immediate instructions need the immediate in the upper 20 bits of a 32-bit value. Finally, the branch/jump instructions need the sign-extended immediate in the lower bits of the value, except for the lowest bit which will always be zero, since RISC-V instructions are always aligned to even addresses.
But why are the immediate bits shuffled? Think this time about the physical circuit which decodes the immediate field. Since it's a hardware implementation, the bits will be decoded in parallel; each bit in the output immediate will have a multiplexer to select which input bit it comes from. The bigger the multiplexer, the costlier and slower it is.
The "shuffling" of the immediate bits in the instruction encoding, therefore, is to make each output immediate bit have as little input instruction bit options as possible. For instance, immediate bit 1 can only come from instruction bits 8 (S-immediate or B-immediate), 21 (I-immediate or J-immediate), or constant zero (U-immediate or R-type instruction which has no immediate). Immediate bit 0 can come from instruction bits 7 (S-immediate), 20 (I-immediate), or constant zero. Immediate bit 5 can only come from instruction bit 25 or constant zero. And so on.
Instruction bit 31 is a special case: for RV-64, bits 32-63 of the immediate are always copies of instruction bit 31. This high fan-out adds a delay, which would be even bigger if it also needed a multiplexer, so it only has one option (other than constant zero, which can be treated later in the pipeline by ignoring the whole immediate).
It's also interesting to note that only the major opcode (bits 0-6) is needed to know how to decode the immediate, so immediate decoding can be done in parallel with decoding the rest of the instruction.
So, answering the questions:
SB-type doubles the range of branches, since instructions are always aligned to even addresses;
UJ-type has the same overall instruction format as U-type, but the immediate value is in the lower bits instead of the upper bits;
The immediate bits are shuffled to reduce the cost of decoding the immediate value, by reducing the number of choices for each output immediate bit;
The "immediate produced by RISC-V instructions" table shows the different kinds of immediate values which can be decoded from a RISC-V instruction, and from where in the instruction each bit comes from;
They are produced by, for each output immediate bit, using the major opcode (bits 0-6) to chose an input instruction bit.

The encoding is done to try and make the actual hardware implementation as simple as possible, rather than make it easy for the reader to understand at a glance.
In practice the compiler will generate the output and so it does not matter if it is not easy for the user to understand.
When possible the SB type tries to use the same bits for the same immediate bit positions as type S, that minimizes the hardware design complexity. So imm[4:1] and imm[10:5] are in the same place for both. The top most bit of the immediate values is always at position 31 so that you can use that bit to decide if a sign extension is needed. Again, this makes the hardware easier because for multiple types of instruction the top bit is used to decide on sign extension.

The RISC-V instruction encoding is chosen to simplify the decoder
2.2 Base Instruction Formats
The RISC-V ISA keeps the source (rs1 and rs2) and destination (rd) registers at the same position in all formats to simplify decoding. Except for the 5-bit immediates used in CSR instructions(Chapter 9), immediates are always sign-extended, and are generally packed towards the left most available bits in the instruction and have been allocated to reduce hardware complexity. In particular, the sign bit for all immediates is always in bit 31 of the instruction to speed sign-extension circuitry.
2.3 Immediate Encoding Variants
The only difference between the S and B formats is that the 12-bit immediate field is used to encode branch offsets in multiples of 2 in the B format. Instead of shifting all bits in the instruction-encoded immediate left by one in hardware as is conventionally done, the middle bits (imm[10:1]) and sign bit stay in fixed positions, while the lowest bit in S format (inst[7]) encodes a high-order bit in B format.
Similarly, the only difference between the U and J formats is that the 20-bit immediate is shiftedleft by 12 bits to form U immediates and by 1 bit to form J immediates. The location of instructionbits in the U and J format immediates is chosen to maximize overlap with the other formats andwith each other.
https://riscv.org/technical/specifications/
The reason for the shuffling of the immediate in SB/UL formats has also been explained in the RISC-V spec
Although more complex implementations might have separate adders for branch and jump calculations and so would not benefit from keeping the location of immediate bits constant across types of instruction, we wanted to reduce the hardware cost of the simplest implementations. By rotating bits in the instruction encoding of B and J immediates instead of using dynamic hard-ware muxes to multiply the immediate by 2, we reduce instruction signal fanout and immediate mux costs by around a factor of 2. The scrambled immediate encoding will add negligible timeto static or ahead-of-time compilation. For dynamic generation of instructions, there is some small additional overhead, but the most common short forward branches have straight forward immediate encodings.

assembly code to add two integers

I have trouble understanding the following assembly code which is used to add two integers using registers. It's not a very cumbersome question, just that I lack any good reference to learn the syntax. If you can provide me with the insight line by line. I would be extremely grateful.
MOV R1, #100
MOV R2, #100
MOV (R1), #50
ADD R2,(R1)
I get the first two lines which will store number 100 in the given registers, I just don't get the purpose of using brackets in next two lines.
And this is not homework, Just a question to clarify the theory behind it.
Question is what are the values of R1, R2 after the instructions have been executed.

I found the following explanation on another website, which helped me a lot to understand the use of brackets. I believe it would be very clarifying for other people too, so I will post it below:
Lets analyze this program:
MOV AX, 47104
MOV DS, AX
MOV [3998], 36
INT 32
... The first instruction, MOV AX, 47104, tells the computer to copy the number 47104 into the location AX. The next instruction, MOV DS, AX, tells the computer to copy the number in AX into the location DS. The next instruction, MOV [3998], 36 tells the computer to put the number 36 into memory location 3998. Finally, INT 32 exits the program by returning to the operating system.
Before we go on, I would like to explain just how this program works. Inside the CPU are a number of locations, called registers, which can store a number. Some registers, such as AX, are general purpose, and don't do anything special. Other registers, such as DS, control the way the CPU works.
DS just happens to be a segment register, and is used to pick which area of memory the CPU can write to. In our program, we put the number 47104 into DS, which tells the CPU to access the memory on the video card.
The next thing our program does is to put the number 36 into location 3998 of the video card's memory. Since 36 is the code for the dollar sign, and 3998 is the memory location of the bottom right hand corner of the screen, a dollar sign shows up on the screen a few microseconds later.
Finally, our program tells the CPU to perform what is called an interrupt. An interrupt is used to stop one program and execute another in its place. In our case, we want interrupt 32, which ends our program and goes back to MS-DOS, or whatever other program was used to start our program.
We can see from this example that the use of brackets resulted in inputting a value into a memory location, and not into a register. Lately, this value was read by the video card to display a symbol on the screen.
Credits to the writer on: http://www.swansontec.com/sprogram.html

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse