whats the difference?:Integrated Circuits vs combinational circuit and sequential circuit - cpu-architecture

As per my understanding, (in terms of Computer Architecture by Moris Mano 3rd edition), a combinational circuit is a group of Gates, whereas sequential circuit is a group of Gates plus flip-flops.
On the hand Integrated Circuits also consist of interconnected Gates, so does that mean they are same?

Combinatorial circuit - a circuit whose output is always fully defined by its current outputs (has no memory)
Sequential circuit - a circuit whose output may depend on the history of the previous inputs.
Integrated circuit - a circuit implemented as a single silicon wafer (or other material). Some integrated circuits are combinatorial, some are sequential. For the official definition, see http://en.wikipedia.org/wiki/Integrated_circuit

Related

Relation between CPI and number of execution units when looking at SIMD intrinsics [duplicate]

This question already has answers here:
latency vs throughput in intel intrinsics
(1 answer)
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
(1 answer)
How many CPU cycles are needed for each assembly instruction?
(5 answers)
Closed 10 days ago.
I understand that the term Cycle Per Instruction closely relates to the superscalarity of the processor, a term which I have not fully understood. According to Wikipedia, "...a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor". In the same article, there is a hint that superscalarity is not necessarily related to instruction pipelining, a concept with which I'm fairly familiar.
Now, let's get concrete by taking the example of _mm256_shuffle_ps, which, according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avxnewtechs=AVX,AVX2,FMA, has a CPI of 0.5 for the Alder Lake micro-architecture.
Questions:
Can I assume that there are exactly 2 identical execution units which execute _mm256_shuffle_ps in all Alder Lake chips?
How can a programmer know which separate instructions involve the same executions units?
If there are different numbers of execution units for different instructions (such as _mm256_shuffle_ps), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?
Thanks in advance for the transfer of knowledge.
Superscalar is usually a term you'd apply to CPU's of old, e.g. the original pentium. Back in those days, you'd have two seperate pipes, the U (primary) and V (secondary) pipe, which would allow you to potentially dispatch two instructions at the same time (i.e. it had 2 execution units). It was effectively a way of getting slightly better performance from an in-order processor core (although that came with caveats - e.g. pipeline bubbles could be an issue)
These days processors tend to use Out of Order Execution (OOOE) backed by a larger number of execution units. Alder Lake CPU's have 12 execution units, however those execution units tend to be specialised to some extent - e.g. load/store, pointer arithmetic, SIMD FPU units, etc. That's why you won't see 12 execution units capable of performing a shuffle. It can dispatch 12 micro-ops per cycle, but those ops can't all be the same instruction.
Can I assume that there are exactly 2 identical execution units which execute _mm256_shuffle_ps in all Alder Lake chips?
No, you can't assume that. You can assume that there are two execution units which are capable of executing _mm256_shuffle_ps, but that doesn't mean those two units are identical. For example, we can see there are 3 execution units that can work on 256bit YMM registers, and we can see from the instruction timings that all 3 can perform _mm_add_epi32. However, only 2 can perform _mm_shuffle_ps, and only 1 can perform _mm_div_ps, so they are clearly not the same....
How can a programmer know which separate instructions involve the same executions units?
Unless the manufacturer explicitly states the capabilities of each execution port (sometimes you'll find that info in the technical manual for the CPU), you're pretty much limited to making educated guesses (e.g. the Apple M1)
If there are different numbers of execution units for different instructions (such as _mm256_shuffle_ps), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?
Modern Intel processors are not superscalar, therefore describing them as such makes no sense at all.
Alder Lake is able to dispatch 12 instructions per clock, using Out-Of-Order-Execution. The types of instruction the execution units can handle, is typically geared up to cover a range of common cases. For example, consider this code:
void func(float* r, float* a, float* b) {
// basic integer ops: increment and less-than
for(int i = 0; i < 128; ++i) {
// 2 address manipulation instructions
float* addr_a = a + i * 4;
float* addr_b = b + i * 4;
// 2 load instructions
__m128 A = _mm_load_ps(addr_a);
__m128 B = _mm_load_ps(addr_b);
// an addition
__m128 R = _mm_add_ps(A, B);
// another address manipulation op
float* addr_r = r + i * 4;
// a store instruction
_mm_store_ps(addr_r, R);
}
}
Providing 12 execution units that are all capable of executing an _mm_add_ps instruction doesn't really make any sense. It makes more sense to balance the number of SIMD execution units with all those other common tasks (e.g. address manipulation, looping, etc).

Out of Order Execution, How to Solve True Dependency?

I was reading about OOOE (Out of Order Execution) and read about how we can solve false dependencies (By using renaming).
But my question is, how can we solve true dependency (RAW - read after write)?
For example:
R1=R2+R3 #1
R1=R4+R5 #2
R9=R1 #3
Renaming won't be helpful here in case CPU chose to run #2 before #1.
There is no way to really avoid them, that's why RAW hazards are called true dependencies. Instructions have to wait for their inputs to be ready before they can execute. (With OoO exec, normally CPUs will dispatch the oldest-ready-first instructions / uops to execution units, for example on Intel CPUs.)
True dependencies aren't something you "solve" in the sense of making them go away, they're the essence of computation, the way multiple computations on the same numbers are glued together to form an algorithm. Other hazards (WAR and WAW) are just implementation details, reusing the same architectural register for something different.
Sometimes you can structure an algorithm to have shorter dependency chains, once things are already nailed down into machine code, the CPU pretty much just has to respect them, with at best out-of-order exec to overlap independent dep chains.
For loads, in theory there's value-prediction, but AFAIK no real CPU is doing that. Mispredictions are expensive, just like for branches. That's why you'd only want to consider that for high-latency stuff like loads that miss in cache, otherwise the gains won't outweigh the costs. (Even then, it's not done because the gains don't outweigh the costs even for loads, including the power / area cost of building a predictor.) As Paul Clayton points out, branch prediction is a form of value prediction (predicting the condition the load was based on). The more instructions you can keep in flight at once with OoO exec, the more you stand to lose from mispredicts, but real CPUs do predict / speculate for memory disambiguation (whether a load reloads an earlier store to an unknown address or not), and (on CPUs like x86 with strongly-ordered memory models) speculating that early loads will turn out to be allowed; as well as the well known case of control dependencies (branch prediction + speculative execution).
The only thing that helps directly with dependency chains is keeping instruction latencies low. e.g. in the case of your example, #3 is just a register copy, which modern x86 CPUs can do with zero latency (mov-elimination), so another instruction dependent on R9 wouldn't have to wait an extra cycle beyond #2 producing a result, because it's handled during register renaming instead of by an execution unit reading an input and producing an output the normal way.
Obviously bypass forwarding from the outputs of execution units to the inputs of the same or others is essential to keep latency low, same as in an in-order classic RISC pipeline.
Or more conventionally, by improving execution units, like AMD Bulldozer family had 2-cycle latency for most SIMD integer instructions, but that improved to 1 cycle for AMD's next design, Zen. (Scalar integer stuff like add was always 1 cycle on any sane high-performance CPU.)
OoO exec with a large enough window size lets you overlap multiple dep chains in parallel (as in this experiment, and of course software should aim to have enough instruction-level parallelism (ILP) for the CPU to find and exploit. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an example of doing that by summing into multiple accumulators for a dot-product.)
This is also useful for in-order CPUs if done statically by a compiler, where techniques like "software pipelining" are a big deal to overlap execution of multiple loop iterations because HW isn't finding that parallelism for you. Or for OoO exec CPUs with a limited window size, for loops with long but not loop-carried dependency chains within each iteration.
As long as you're bottlenecked on something other than latency / dependency chains, true dependencies aren't the problem. e.g. a front-end bottleneck, ideally maxed out at the pipeline width, and/or all relevant back-end execution units busy every cycle, mean that you couldn't get more work through the pipeline even if it was independent.
Of course, in a lot of code there are enough dependencies, including through memory, to not reach that ideal situation.
Simultaneous Multithreading (SMT) can help to keep the back-end fed with work to do, increasing throughput by having the front-end of one physical core read multiple instruction streams, acting as multiple logical cores. This effectively creates ILP out of thread-level parallelism, which is useful if software can scale efficiently to more threads, exposing enough TLP to keep all the logical cores busy.
Related:
Modern Microprocessors A 90-Minute Guide!
How many CPU cycles are needed for each assembly instruction? - that's not how it works on superscalar OoO exec CPUs; latency or throughput or a front-end bottleneck might be the relevant thing.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

How can a connection between one gate input with mutiple outputs of other gates causes circuit memory?

I'm reading the Digital Design and Computer Architecture by David Harris, Sarah Harris. The authors give the following definition of combinational logic:
A combinational circuit’s outputs depend only on the current values of
the inputs; in other words, it combines the current input values to
compute the output... A combinational circuit is memoryless, but a
sequential circuit has memory. The functional specification of a
combinational circuit expresses the output values in terms of the
current input values.
However, they claim this circuit is not combinational:
because "node n6 connects to the output terminals of both I3 and I4". Indeed, it's one of the designated signs when a scheme can not be combinational but, according to the authors:
Certain circuits that disobey these rules are still combinational, so
long as the outputs depend only on the current values of the inputs.
As I'm able to catch on, the aforementioned circuit is the case: its output is 1 if and only if its inputs are both 1, otherwise the output is 0. So the output is defined as a function of the inputs (the AND function).
In fact, there was already a question about this circuit in the computer science network and it has an accepted answer. Here's an excerpt from it:
Circuit (d) cannot be written in this form [of formula], since the
outputs of I3 and I4 are wired together. What is the relation between
the input to the rightmost gate and the outputs of I3 and I4? Not
something that can be described combinatorially.
Unfortunately, I'm still confused due to
The circuit, regarded as a black box, is still in scope of the combinational logic definition: its output values depend only on the current values of the inputs;
The relation between the input to the rightmost gate and the outputs of I3 and I4 can be described through the function NAND of the circuit inputs and this function is quite "memoryless". It's not obvious for me why we can't afford to depict a gate input using multiple outputs of other gates.
I need some elaboration. Maybe things would fall into place if someone provide a circuit example when two gates outputs is connected to one input and it actually causes "memory" (in contrast to the considered sample).
Circuit (d) is not combinational because it is not a logic gate circuit at all.
I think it's a very silly example to explain combinational vs sequential circuits.
In a logic circuit, an output wire cannot go to another output wire. You assumed that the outputs, when connected together, will act as a logical OR or AND of themselves.
This is not true (otherwise why would we use AND/OR gates in the first place?).
What will happen depends on the specific implementation of the gates (i.e. specific IC or manufacturing process you used) and this is not something that a logic circuit is meant to model.
A logic circuit must behave the same, no matter what brand you are using.
In circuit (d), the output of I3 will feed both the input of the rightmost NOT and the output of I4 (the complementary is also true).
Most IC will break if a current will flow in from their outputs, others won't but they will interfere with the capability of the right-most NOT to sense its input.
Logic circuits are still circuits, so you should, in theory, perform a full circuit analysis, which includes solving differential equations, to solve for their output.
Digital electronics is a branch that abstracts from these "low-level" details but at the cost of making some assumptions, one of which is: outputs are never merged without a gate.
The whole point of a combinational circuit is that you can write out = f(in0, in1, ..., ink) but it's not always possible.
Take for example an edge detector, it is just a f(A) = (NOT A) AND A which should, by the law of the excluded middle, always output 0.
But it will not because the NOT A path takes a slightly longer time to reach the AND input.
How can you describe this dynamic behaviour with a f(A) function?
Don't think too much of it, when you'll get to sequential circuits you'll spot the difference immediately (if you need a preview, look up for "latch circuit").

Q# ResourcesEstimator for quantum chemistry of 1000+ qubit systems

This is a q# question about resource estimation on quantum chemistry problems
In the docomentation for ResourcesEstimator, it states that ...by executing the quantum operation without actually simulating the state of a quantum computer; for this reason, it can estimate resources for Q# operations that use thousands of qubits.
I am wondering how we can perform Quantum Chemistry simulation resource estimation on thousands of qubits. Although a quantum circuit of thousands of qubits can be an input to ResourcesEstimator, it is not clear to me how to generate the quantum circuit using the conventional workflow as described in this documentation on end-to-end with NWChem.
As far as I understand, the .nw file suggests generating the molecular electron-integrals which outputs to a BroomBridge .yaml file which loads to the GetGatecount and similar resource estimators. However, in a 1000+ qubit chemistry simulation, just the generation of the yaml file would take days on a beefy computer and the filesize would be giga or terabytes in size.
My question is; can we do this resource estimation without explicitly calculating the Hamiltonian matrix elements? If not, how do you propose doing these large-scale resource estimations 'up to thousands of qubits'?
Thanks for your help! [q#]
It would be more accurate to say "it can estimate resources for Q# operations that use thousands of qubits, if the classical part of the code can be executed in a reasonable time".
QDK resource estimator is basically a special simulator which still "executes" the Q# program it gets. Unlike the full state or Toffoli simulators, though, it does not simulate the effect of the gates and measurements on the state of the quantum systems - instead it increments certain counters that track the metrics produced by resource estimator. For example, if you use a T gate, it will increment the counter of T gates but will not touch the counter of Pauli gates or CNOTs.
This means that the resource estimator can run much larger programs than the other simulators (the main restriction on full state simulator comes from the need to update the full state of the system, which grows larger than the available memory around 30-40 qubits). But it still needs to be able to run the program, going through all the gates and all the classical computations involved, even if going through the gates is much more lightweight than on a full state simulator.

Simulink large scale modeling: best practices for interconnecting blocks

What are the best practices for large scale modeling in Simulink when it comes to connecting blocks? Would you use the same structure for all I/O ports of your blocks to facilitate their interconnection (but obviously there will be a lot of redundant signals) or would you define custom structures for each I/O port type with only the necessary information?
For example:
A reactor is modeled as a single block with 4 inputs and 1 output:
I1. Feed which is a structure containing: Flow and Concentrations (7 species);
I2. Mass flow of enzymes - scalar;
I3. Mass flow of water - scalar;
I4. Outflow - which is adjusted by a controller to keep a constant mass in the tank - scalar;
O1. The outstream, which is a struct: Flow and Concentrations
(let's say 10 species).
Now imagine this reactor block is only a tiny piece of an entire process. There are enzymes and water tanks connected to it and some other downstream processes etc.
Would you use a unique structure for all IO ports (even if it scales up to 50-100 components but you would need less per block or 1 component like I2, I3 and I4 above which are scalars)? Is this regarded as bad programming practice?
Or would you customize the IO port structure for each block? Of course you would group them somehow and make reuse of them but with no redundant information.
Thanks!
You might find the following useful: http://www.mathworks.co.uk/videos/tips-and-tricks-for-large-scale-model-based-design-part-2-81873.html.
I would personally use a single bus input and a single bus output for your reactor block. You can then group buses together to form larger bus signals as you move up the hierarchy of your model. Look at the Bus Creator and Bus Selector blocks.