How do the two program counter registers work in the 6502? - 6502

I am currently developing a subset of the 6502 in LogiSim and at the current stage I am determining which parts to implement and what can be cut out. One of my main resources is Hanson's Block Diagram.
I am currently trying to determine how exactly the increment logic works. In a previous project that I worked on in school, the program counter was incremented via a single instruction coming from the decoded instruction memory. In this diagram, the program counter logic looks like it works differently than I have previously encountered.
How exactly does this logic work and does it use an instruction from the instruction memory to increment? As a follow up, Is it possible to simplify the program counter logic to use one or two instructions from the instruction memory to increment?

The 6502 has only one program counter. It is 16 bits wide. Because a lot of other things in the CPU are exactly 8 bits wide, it makes hardware sense to cut the 16 bit program counter into two halves, so that each half fits in 8 bits. For example, each half is loaded separately, one by one, with an instruction like JMP. And the relative branch instructions bring PCL to an ALU input.
Those halves are just called PCH and PCL internally. You can see that PCL has increment logic attached to it, and one of the outputs is a carry out signal called PCLC. That's the input to another circuit that increments PCH.
None of this matters to a program. The program just cares that PC points to the instruction that's going to be executed next, and uses that fact to influence its own flow. But if you're interested in knowing more of these details, I'll point you to the Visual6502 simulation. It's much more accurate and detailed than Hanson's Block Diagram.

Related

What is the purpose of the CIR if I have the MDR in Von Neumann Architecture?

From the fetch decode execute cycle of the Von Neumann Architecture, at a basic level, here is what I understand:
Memory address in PC is copied to MAR.
PC +=1
The instruction / data in the address of the MAR is stored in the MDR after being fetched from main memory.
Instruction from MDR is copied to CIR
Instruction / data in memory is decoded & executed by the CU .
Result from the calculation stored in ACC.
Repeat
Now if the MDR value is copied to the CIR, why are they both necessary. I am quite new to systems architecture so I may have gotten the wrong end of the stick, but I've tried my best :)
Think about what happens if the current instruction is a load or store: does anything need to happen after the MDR? If so, how is the CPU going to remember what it's in the middle of doing if it doesn't keep track of that somehow.
Whether that requires the original instruction bits the whole time or not depends on the design.
A CPU that needs to do a lot of decoding (e.g. a CISC with a compact variable-length instruction set, like original 8086) may not keep the actual instruction itself around, but instead just some decode results. For example, actual 8086 decoded incrementally, scanning through prefixes one byte at a time until reaching the opcode. And modern x86 decodes to uops which it sends down the pipeline; the original machine-code bytes aren't needed.
But CPUs like MIPS were specifically designed so parts of the instruction word could be used directly as internal control signals. Still, it's not always necessary to keep the whole instruction around in one piece.
It might make more sense to look at CIR as being the input latches of the decoding process that produces the necessary internal control signals, or sequence of microcode depending on the design. Having a truly physical CIR separate from that is ok if you don't mind redoing decoding at any step you need to consult it to figure out what step to do next.

If the PC register is simultaneously read and written, does its read data contain the previous data or the newly-written data?

If the PC register is simultaneously read and written, does its read data contain the previous data or the newly-written data? Based on my understanding of sequential circuits, the effect of the write command does not instantly take effect in the PC register due to propagation delay so, at the rising edge of the clock, the read command will get the old value. But corollary to my question is if this is the case, shouldn't the read command would also have a delay in some sense and could possibly read the newly-written data?
A program counter is normally special enough that it's not part of a register file with other registers. You don't have a "read command", its output is just always wired up to other parts that read it when appropriate. (i.e. when its output is stable and has the value you want). e.g. see various block diagrams of MIPS pipelines, or non-pipelined single-cycle or multi-cycle designs.
You'd normally build such a physical register out of edge-triggered flip-flops, I think. (https://en.wikipedia.org/wiki/Flip-flop_(electronics)). Note that a D flip-flop does latch the previous input as the current output on a clock edge, and then the input is allowed to change after that.
There's a timing window before the clock edge where the input has to remain stable, it can start to change a couple gate delays after. Note the example of a shift register built by chaining D flip-flops all with the same clock signal.
If you have a problem arranging to capture a value before it starts changing, you could design in some intentional clock skew so the flip-flop reliably latches its input before you trigger the thing providing the input to change it. (But normally whatever you're triggering will itself have at least a couple gate delays before its output actually changes, hence the shift-register made of chained D flip-flops.)
That wiki article also mentions the master-slave edge-triggered D Flip-Flop that chains 2 gated (not clocked) D latches with an inverted clock, so capturing the input happens on the opposite clock edge from updating the output with the previously-captured data.
By comparison and for example, in register files for general-purpose registers in classic RISC pipelines like MIPS, IIRC it's common to build them so write happens in the first half-cycle and read happens in the second half-cycle of the ID stage. (So write-back can "forward" to decode/fetch through the register file, keeping the window of bypass-forwarding or hazards shorter than if you did it in the other order.)
This means the write data has a chance to stabilize before you need to read it.
Overall, it depends how you design it!
If you want the same clock edge to update a register with inputs while also latching the old value to the output, you a master-slave flip-flop will do that (capture the old input into internal state, and latch the old internal state onto the outputs).
Or you could design it so the input is captured on the clock edge, and propagates to the output after a few gate delays and stays latched there for the rest of this clock cycle (or half cycle). That would be a single D flip-flop (per bit).

Why do we need to specify the number of flash wait cycles?

Especially when working with "faster" devices like STMF4xx/F7xx we need to specify the number of flash wait cycles, based on the supply voltage and the sys-clock frequency.
When the CPU fetches instructions/or constants this is done over the FLITF. Am I right with the assumption that the FLITF holds a CPU request as long as it can provide the requested data, making it impossible for other Bus-Masters to access flash meanwhile.
If this was true, why should it be important to any interface to know flash wait cycles. Like Cache does preload instructions so or so, independent if it knows how long to wait, no?
Because the flash interface isn't magic.
It has to meet the necessary setup and hold times for addressing and reading out the flash cells, which will vary somewhat depending on voltage. Taking the STM32F411 as an example (because I have that TRM handy), doing some maths with the voltage/frequency/wait-state table implies that a read from flash on one of those takes in the order of ~30ns above 2.7V, down to ~60ns below 2.1V.
Since the flash interface doesn't have its own asynchronous nanosecond-precision timekeeping ability (because that would be needlessly complicated, power-hungry, and silly), that translates to asserting its signals for n clock cycles, after which it can assume the data signals from the cells are stable enough to read back*. How does it know what the clock frequency is, and therefore what n should be? Simple: you, as the programmer who set the clock, tell it. Some hardware things are just infinitely easier to let software deal with.
* and then going through the further shenanigans of extracting the relevant 8, 16 or 32 bits out of the 128-bit line it's read, to finally spit that out the other side onto the AHB bus to the waiting CPU, obviously.

CPU clock cycle misunderstanding

I can't well understand about CPU clock such as 3.4Ghz. I know this is that 3.4 billions clock cycle per second.
So here if machine use single clock cycle instruction, then It can execute about 3.4 billions instructions per second.
But in pipeline, basically it needs more cycles per instruction, but each cycle length is shorter than single clock cycle.
But although pipeline has more throughput, anyway cpu can do 3.4 billions cycle per second. So, it can execute 3.4 billions/5 instructions(if one instruction needs 5 cycles), which means less than single cycle implementation(3.4 > 3.4/5). What am I missing?
Does CPU clock such as 3.4Ghz just means for based on pipeline cycle, not for based on single cycle implentation?
Pipelining
Pipelining doesn't involve cycles shorter than a single clock cycle. Here's how pipelining works:
We have a complicated task to do. We take that task and break it down into a number of stages, each of which is relatively simple to carry out. We study the amount of work in each stage to make sure each stage takes about the same amount of time as any other.
With a processor, we do roughly the same thing--but in this case, it's not "install these fourteen bolts", it's things like fetching and decoding instructions, reading operands, executing (often a couple of stages here), and writing back results.
Like the automotive production line, we provide each stage of the pipeline with a specialized set of tools for doing exactly (and only) what is needed at that stage. When we finish doing one stage of processing on a car/instruction, it moves along to the next stage, and this stage gets the next car/instruction to process.
In an ideal situation, the process works (roughly) like this:
It took Ford about 12 hours to build one N car (the predecessor to the model T). Thanks primarily to pipelining the production line, it took only about 2 and a half hours to build a Model T. More importantly, even though a model T took 2.5 hours start to finish, that time was broken down into no fewer than 84 discrete steps, so when everything ran smoothly the production line as a whole could produce another car (about) every two minutes.
That didn't always happen though. If one stage ran short of parts, the stages after it had to wait. If the pause lasted very long, it would back things up so the preceding stages had to wait too.
The same can happen in a processor pipeline. For example, when a branch happens, the processor may have to wait a while before the next instruction can be fetched. If an instruction needs an operand from memory, that can lead to a pause (a "pipeline bubble") as well.
To prevent pauses in his pipeline, Henry Ford hired people to study the stages, figure out how many of each kind of part would need to be on hand for each stage, and so on. I don't know for sure, but I think it's a fair guess that there were probably a few people designated to watch the supply of parts at different stations, and send somebody running to let a warehouse manager know if (for whatever reason) the supply of parts for a particular stage looked like it was running short so they'd need more soon.
Processors do a little of the same thing--they have things like branch predictors and prefetchers that attempt to figure out ahead of time what will be needed by the stream of instructions being executed, and trying to ensure that everything is on hand when its needed (with caches, for example, to temporarily store things that seem likely to be needed soon).
So, like the Model T, it takes some relatively long amount of time for each instruction to execute start to finish, but we get another product finished at much shorter intervals--ideally once a clock (but see my other answer--modern designs often execute more than one instruction per clock).
A typical modern CPU can execute a number of unrelated instructions (those that don't depend on the same resources) concurrently.
To do that, it typically ends up with a basic structure vaguely like this:
So, we have an instruction stream coming in on the left. We have three decoders, each of which can decode one instruction each clock cycle (but there may be limitations, so complex instructions all have to pass through one decoder, and the other two decoders can only do simple instructions).
From there, the instructions pass into a reorder buffer, which keeps a "scoreboard" of which resources are used by each instruction, and which resources are affected that instruction (where a "resource" would typically be something like a CPU register or a flag in the flags register).
The circuitry then compares those scoreboards to determine dependencies. For example, if one instruction writes to register 0, and a later one reads from register 0, then those instructions must execute serially. At each clock, it tries to find the N oldest instructions that don't have dependencies for execution.
There are then a number of independent execution units. Each of these is basically a "pure" function--it takes some inputs, carries out a specified transformation on it, and produces an output. This makes it easy to replicate them as needed, and have as many running in parallel as we want/can afford. Those are typically grouped, with one port going to each group. In each clock, we can send one instruction through that port to one of the execution units in that group. Once an instruction arrives at the execution unit, it may take more than one clock to finish execution.
Once those execute, we have a set of retirement units that take the results, and write them back to the registers in execution order. Again we have multiple units so we can retire multiple instructions per clock.
Note: this drawing tries to be semi-realistic about the rough number of decoders, retirement units, and ports that it depicts, but what it shows is a general idea--different CPUs will have more or fewer specific resources. For almost any of them, the number of decoded instructions in the scoreboard units is low though--a realistic number would be more like 50 instructions.
In any case, actual execution of instructions is one of the hardest parts of this to measure or reason about. The number of ports gives us a hard upper limit on the number of instructions that can start executing in any given clock. The number of decoders and retirement units give an upper limit on the number of instructions that can be started/finished per clock. The execution itself...well, there are a lot of execution units, and each one (at least potentially) takes a different number of clocks to execute an instruction.
With the design as shown above, you'd have a hard upper limit of three instructions per clock. That's the most you can decode or retire. With a different design, that could obviously go up or down (e.g., with 4 decoders, 4 ports and 4 retirement units, the upper limit could go up to 4).
Realistically, with that design you wouldn't normally expect to see three instructions execute in most clock cycles. There are enough dependencies between instructions that you'd probably expect closer to 2 as a long term average (and much more likely a little less than 2). Increasing the available resources (more decoders, more retirement units, etc.) will rarely help that a whole lot--you might get to an average of three instructions per clock, but hoping for four is probably unrealistic.
As others have noted the full details of how a modern CPU operates are complicated. But part of your question has a simple answer:
Does CPU clock such as 3.4Ghz just means for based on pipeline cycle,
not for based on single cycle implentation?
The clock frequency of a CPU refers to how many times per second the clock signal switches. The clock signal is not divided into smaller pipelined segments. The purpose of pipelining is to allow for faster clock switching speeds. So 3.4GHz refers to the number of times per second that a single pipeline stage can perform whatever work it needs to do when executing an instruction. The total work for executing an instruction is done over multiple cycles each of which could be in a different pipeline stage.
Your question also shows a some misconceptions about how pipelining works:
But although pipeline has more throughput, anyway cpu can do 3.4
billions cycle per second. So, it can execute 3.4 billions/5
instructions(if one instruction needs 5 cycles), which means less than
single cycle implementation(3.4 > 3.4/5). What am I missing?
In the simple case the throughput of a single cycle CPU and a pipelined CPU is the same. The latency of the pipelined CPU is higher because it requires more cycles (i.e. 5 in your example) to execute a single instruction. But after the pipeline is full the throughput could be the same as for a single cycle non-pipelined CPU. So in the simple case using your example a single-cycle CPU could execute 3.4 billion instructions in 1 seconds, while the pipelined CPU with 5 stages could execute 3.4 billion minus 5 instructions in 1 second. Subtracting 5 from 3.4 billion is a negligible difference, whereas dividing by 5 would be a very significant difference.
A couple of other things to note are that the simple case I described isn't really true because of dependencies between instructions that require pipeline stalls. And most modern CPUs can execute more than one instructions per cycle.

Gforth parallel processing

I have written a Forth Mandelbrot fractal plotter, and as much as a technical exercise as anything else I would like to try to speed it up with some parallel processing.
For the time being I would be happy if I could just use both of my cores (have one core do one half of the image and the other the other half).
I have noticed that Windows XP will quite happily manage two instances of Gforth and try use as much processor capacity as possible, so running two processes could be a start. However I am not sure if they can share memory, or if they can both write to a file at the same time (or how to tell one process to start writing at x bytes from the start of the file).
In summary, how can I do parallel processing using Gforth on Windows XP?
You could have each program do a grid of pixels rather than a single pixel, and then recombine them in the end.
AFAIK, pixels in Mandelbrot sets are independent of each other (someone correct me if I am wrong), however the computation of each of them is non-deterministic, making it a hard problem to properly parallelize, without having some kind of central dispatch thread (then again you run into potential problems with contention).
See GForth Pipes.