Where in the Fetch-Execute cycle is a value via an address mode decoded - cpu-architecture

I'm currently building a small CPU interpreter that has support several addressing modes, including register-deferred and displacement. It utilizes the classic IF-ID-EX-MEM-WB RISC-pipeline. In what stage of the pipeline is the value for an address-moded operand decoded. For example:
addw r9, (r2), 8(r3)
In what stage is (r2) and 8(r3) would be decoded into their actual values?

It's a funny question.
One property of RISC architectures is register-register operation. That is, all operands for computation instructions such as ADD must already be in registers. This enables RISC implementations to enjoy a regular pipeline such as the IF-ID-EX-MEM-WB pipeline you mention in your question. This constraint also simplifies memory accesses and exceptions. For example, if the only instructions to read data from memory are load instructions, and if these instructions have only a simple addressing mode like register+displacement, then a given instruction can incur at most one memory protection exception.
In contrast, CISC architectures typically permit rich operand addressing modes, such as register indirect, and indexed as in your question. Implementations of these architectures often have an irregular pipeline, which may stall as one or more memory accesses are incurred before the operands are available for the computation (ADD etc.).
Nonetheless, microarchitects have successfully pipelined CISC architectures. For example, the Intel 486 had a pipeline that enabled operands and results to be read/written to memory. So when implementing ADD [eax],42, there was a pipeline stage to read [eax] from the 8 KB d-cache, a pipeline stage to perform the add, and another pipeline stage to write-back the sum to [eax].
Since CISC instruction and operand usage is dynamically quite mixed and irregular, your pipeline design would either have to be rather long to account for the worst case, e.g. multiple memory reads to access operands and a memory write to write-back a result, or it would have to stall the pipeline to insert the additional memory accesses when necessary.
So to accomodate these CISCy addressng modes, you might need a IF-ID-EA-RD1-RD2-EX-WR pipeline (EA=eff addr, RD1=read op 1, RD2=read op 2, WR=write result to RAM or reg file).
Happy hacking.

As Jan Gray pointed out, the CISC instruction you mention addw r9, (r2), 8(r3)
does not map directly onto a IF-ID-EX-MEM-WB RISC pipeline.
But rather than creating a IF-ID-EA-RD1-RD2-EX-WR pipeline (which I don't think works for ths case anyway, at least not in my notation), you might also consider breaking the CISC instruction up into RISC-like microinstructions
tmp1 := load Memory[ r2 ]
tmp2 := load Memory[ 8+r3 ]
r9 := addw tmp1 + tmp2
With this uop (micro-operation) decomposition,
the address computations (r2) and 8(r3) would be done in their respective EX pipestages,
and the actual memory access in and around the MEM pipestage.
As Jan mentions, the i486 had a different pipeline, a so-called load-op pipeline:
IF-ID-AGU-MEM-EX-WB, where AGU is the address generation unit / pipestage.
This permits a different uop decomposition
tmp1 := load Memory[ r2 ]
r9 := addw tmp1 + load Memory[ 8+r3 ]
with the address computations (r2) and 8(r3) done in the AGU pipestage.

As Jan Gray mentioned above, the instruction you are trying to execute doesn't really work for this pipeline. You need to load the data in the MEM stage, and do some math on it in the EX stage (which is before mem).
To answer another related question though, if you wanted to do:
Load R9, 8(R3)
The value for the 'value for the value-modded operand' is calculated in the EX stage.

Related

Why the immediate offset in the riscv's JAL instruction has bit order changed?

The bit field is shown below
I don't see the point the of doing this re-ordering of bit-field.
Is there a special kind of manipulation when RISC-V processor is executing this instruction?
The purpose of the shuffling is to reduce the number of muxs involved in constructing the full sized operand from the immediates across the different instruction types.
For example, the sign-extend bit (which drives a lot wires) is always the same (inst[31]). You can also see that imm[10] is almost always in the same place too, across I-type, S-type, B-type, and J-type instructions.

Query on various addressing modes?

Can an array be implemented using only indirect addressing mode? I think we can only access the first element but what about the other elements? For that, I think, we'll have to use immediate addressing mode.
An add instruction can generate an address in a register.
A CPU with only [register] addressing modes would work, but need more instructions than one with an immediate displacement as part of load/store instructions.
Instruction set design isn't about what's necessary for computation to be possible, but rather about how to make it efficient.
related:
Why does the lw instruction's second argument take in both an offset and regSource?
What is the minimum instruction set required for any Assembly language to be considered useful? (note the difference between useful and Turing-complete.)

Difference between branching and select instructions

The 'select' instruction is used to choose one value based on a condition, without branching.
I want to know the differences between branching and select instructions (preferably for both x86 architectures and PTX). As far as I know, select is more optimal compared to branching instructions, but I don't have a clear picture.
Branching is a general-purpose mechanism used to redirect control flow. It is used to implement most forms of the if statement (when specific optimizations don't apply).
Selection is a specialized instruction available on some instruction sets which can implement some forms of the conditional expression
z = (cond) ? x : y;
or
if(cond) z = x;
provided that x and y are plain values (if they were expressions, they would both have to be computed before the select, which might incur performance penalties or incorrect side-effect evaluation). Such an instruction is necessarily more limited than branching, but has the distinct advantage that the instruction pointer doesn't change. As a result, the processor does not need to flush its pipeline on a branch misprediction (since there is no branch). Because of this, a select instruction (where available) is faster.
On some superscalar architectures, e.g. CUDA, branches are very expensive performance-wise because the parallel units must remain perfectly synchronized. On CUDA, for example, every execution unit in a block must take the same execution path; if one thread branches, then every unit steps through both branches (but will execute no-operations on the branch not taken). A select instruction, however, doesn't incur this kind of penalty.
Note that most compilers will, with suitable options, generate 'select'-style instructions like cmov if given a simple-enough if statement. Also, in some cases, it is possible to use bitwise manipulation or logical operations to combine a boolean conditional with expression values without performing a branch.

Can 'mov' instructions, which do not require any offset/displacement added to it, be executed without any assistance of ALU?

I've recently started exploring the field of computer architecture. While studying the instruction set architecture, I came across 'mov' instruction which copies data from one location to another. I understand that some type of mov' instructions are conditional while some need to have offset or displacement added to it to find a particular address, and hence they need ALU assistance. For e.g. Base-plus-index, Register relative, Base relative-plus-index, Scaled index etc.
I was wondering, if it is possible to bypass ALU for those mov' instructions (for e.g. register to register data transfer) who do not require any ALU assistance.
Yes. Obviously, an instruction that doesn't require any arithmetic to be performed doesn't require the assistance of the ALU.
Obviously, though, it still requires the "intervention of microprocessor"; the registers, program counter, instruction fetch/decode/execute pipeline are all part of the CPU.

Is single float assignment an atomic operation on the iPhone?

I assume that on a 32-bit device like the iPhone, assigning a short float is an atomic, thread-safe operation. I want to make sure it is. I have a C function that I want to call from an Objective-C thread, and I don't want to acquire a lock before calling it:
void setFloatValue(float value) {
globalFloat = value;
}
In 32-bit ARM, the function above will be compiled to
ldr r2, [pc, #0x??] ; to retrieve the address of globalFloat
str r0, [r2] ; store value into globalFloat
As there are 2 instructions, and the CPU is free to perform anything between them, but only the second instruction str r0, [r2] affects memory. Unless globalFloat is unaligned, the CPU can perform single-copy atomic write to it.
The assignment can be non-atomic when the global pointer is unaligned. It is also non-atomic if you are writing to a larger structure e.g. CGRect.
Being atomic is not enough for thread safety. Due to caching and instruction reordering, your change may not be visible to other CPU cores. You may need to insert an OSMemoryBarrier() to "publish" the change.
Atomic operations are usually interesting when it involves compound operations (e.g. globalFloat += value). You may want to check out the built-in OSAtomic library for it.
Yes, it will be atomic. On a 32-bit architecture, any store operation on a primitive datatype that is 32-bits or smaller (char, short, int, long, float, etc.) will be atomic.
There is more to your question than just atomicity. Even if a write is atomic, there is no guarantee that another thread will see the change without some kind of memory barrier. This probably isn't a problem with current iPhones because they only have one cpu, but it can be a problem on desktop machines.
See:
C++ and the Perils of Double-Checked Locking in DDJ (PDF)
Memory Barriers
StackOverflow discussion on thread data sharing