This is a q# question about resource estimation on quantum chemistry problems
In the docomentation for ResourcesEstimator, it states that executing the quantum operation without actually simulating the state of a quantum computer; for this reason, it can estimate resources for Q# operations that use thousands of qubits.
I am wondering how we can perform Quantum Chemistry simulation resource estimation on thousands of qubits. Although a quantum circuit of thousands of qubits can be an input to ResourcesEstimator, it is not clear to me how to generate the quantum circuit using the conventional workflow as described in this documentation on end-to-end with NWChem.
As far as I understand, the .nw file suggests generating the molecular electron-integrals which outputs to a BroomBridge .yaml file which loads to the GetGatecount and similar resource estimators. However, in a 1000+ qubit chemistry simulation, just the generation of the yaml file would take days on a beefy computer and the filesize would be giga or terabytes in size.
My question is; can we do this resource estimation without explicitly calculating the Hamiltonian matrix elements? If not, how do you propose doing these large-scale resource estimations 'up to thousands of qubits'?
Thanks for your help! [q#]
It would be more accurate to say "it can estimate resources for Q# operations that use thousands of qubits, if the classical part of the code can be executed in a reasonable time".
QDK resource estimator is basically a special simulator which still "executes" the Q# program it gets. Unlike the full state or Toffoli simulators, though, it does not simulate the effect of the gates and measurements on the state of the quantum systems - instead it increments certain counters that track the metrics produced by resource estimator. For example, if you use a T gate, it will increment the counter of T gates but will not touch the counter of Pauli gates or CNOTs.
This means that the resource estimator can run much larger programs than the other simulators (the main restriction on full state simulator comes from the need to update the full state of the system, which grows larger than the available memory around 30-40 qubits). But it still needs to be able to run the program, going through all the gates and all the classical computations involved, even if going through the gates is much more lightweight than on a full state simulator.
I was reading about OOOE (Out of Order Execution) and read about how we can solve false dependencies (By using renaming).
But my question is, how can we solve true dependency (RAW - read after write)?
For example:
R1=R2+R3 #1
R1=R4+R5 #2
R9=R1 #3
Renaming won't be helpful here in case CPU chose to run #2 before #1.
There is no way to really avoid them, that's why RAW hazards are called true dependencies. Instructions have to wait for their inputs to be ready before they can execute. (With OoO exec, normally CPUs will dispatch the oldest-ready-first instructions / uops to execution units, for example on Intel CPUs.)
True dependencies aren't something you "solve" in the sense of making them go away, they're the essence of computation, the way multiple computations on the same numbers are glued together to form an algorithm. Other hazards (WAR and WAW) are just implementation details, reusing the same architectural register for something different.
Sometimes you can structure an algorithm to have shorter dependency chains, once things are already nailed down into machine code, the CPU pretty much just has to respect them, with at best out-of-order exec to overlap independent dep chains.
For loads, in theory there's value-prediction, but AFAIK no real CPU is doing that. Mispredictions are expensive, just like for branches. That's why you'd only want to consider that for high-latency stuff like loads that miss in cache, otherwise the gains won't outweigh the costs. (Even then, it's not done because the gains don't outweigh the costs even for loads, including the power / area cost of building a predictor.) As Paul Clayton points out, branch prediction is a form of value prediction (predicting the condition the load was based on). The more instructions you can keep in flight at once with OoO exec, the more you stand to lose from mispredicts, but real CPUs do predict / speculate for memory disambiguation (whether a load reloads an earlier store to an unknown address or not), and (on CPUs like x86 with strongly-ordered memory models) speculating that early loads will turn out to be allowed; as well as the well known case of control dependencies (branch prediction + speculative execution).
The only thing that helps directly with dependency chains is keeping instruction latencies low. e.g. in the case of your example, #3 is just a register copy, which modern x86 CPUs can do with zero latency (mov-elimination), so another instruction dependent on R9 wouldn't have to wait an extra cycle beyond #2 producing a result, because it's handled during register renaming instead of by an execution unit reading an input and producing an output the normal way.
Obviously bypass forwarding from the outputs of execution units to the inputs of the same or others is essential to keep latency low, same as in an in-order classic RISC pipeline.
Or more conventionally, by improving execution units, like AMD Bulldozer family had 2-cycle latency for most SIMD integer instructions, but that improved to 1 cycle for AMD's next design, Zen. (Scalar integer stuff like add was always 1 cycle on any sane high-performance CPU.)
OoO exec with a large enough window size lets you overlap multiple dep chains in parallel (as in this experiment, and of course software should aim to have enough instruction-level parallelism (ILP) for the CPU to find and exploit. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an example of doing that by summing into multiple accumulators for a dot-product.)
This is also useful for in-order CPUs if done statically by a compiler, where techniques like "software pipelining" are a big deal to overlap execution of multiple loop iterations because HW isn't finding that parallelism for you. Or for OoO exec CPUs with a limited window size, for loops with long but not loop-carried dependency chains within each iteration.
As long as you're bottlenecked on something other than latency / dependency chains, true dependencies aren't the problem. e.g. a front-end bottleneck, ideally maxed out at the pipeline width, and/or all relevant back-end execution units busy every cycle, mean that you couldn't get more work through the pipeline even if it was independent.
Of course, in a lot of code there are enough dependencies, including through memory, to not reach that ideal situation.
Simultaneous Multithreading (SMT) can help to keep the back-end fed with work to do, increasing throughput by having the front-end of one physical core read multiple instruction streams, acting as multiple logical cores. This effectively creates ILP out of thread-level parallelism, which is useful if software can scale efficiently to more threads, exposing enough TLP to keep all the logical cores busy.
Modern Microprocessors A 90-Minute Guide!
How many CPU cycles are needed for each assembly instruction? - that's not how it works on superscalar OoO exec CPUs; latency or throughput or a front-end bottleneck might be the relevant thing.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
Is there a way to calculate the electricity consumed to load and render a webpage (frontend)? I was thinking of a 'test' made with phantomjs for example:
load a web page
scroll to the bottom
And measure how much electricity was needed. I can perhaps extrapolate from CPU cycle. But phantomjs is headless, rendering in real browser is certainly different. Perhaps it's impossible to do real measurements.. but with an index it may be possible to compare websites.
Do you have other suggestions?
It's pretty much impossible to measure this internally in modern processors (anything more recent than 286). By internally, I mean by counting cycles. This is because different parts of the processor consume different levels of energy per cycle depending upon the instruction.
That said, you can make your measurements. Stick a power meter between the wall and the processor. Here's a procedure:
Measure the baseline energy usage, i.e. nothing running except the OS and the browser, and the browser completely static (i.e. not doing anything). You need to make sure that everything is stead state (SS) meaning start your measurements only after several minutes of idle.
Measure the usage doing the operation you want. Again, you want to avoid any start up and stopping work, so make sure you start measuring at least 15 seconds after you start the operation. Stopping isn't an issue since the browser will execute any termination code after you finish your measurement.
Sounds simple, right? Unfortunately, because of the nature of your measurements, there are some gotchas.
Do you recall your physics classes (or EE classes) that talked about signal to noise ratios? Well, a scroll down uses very little energy, so the signal (scrolling) is well in the noise (normal background processes). This means you have to take a LOT of samples to get anything useful.
Your browser startup energy usage, or anything else that uses a decent amount of processing, is much easier to measure (better signal to noise ratio).
Also, make sure you understand the underlying electronics. For example, power is VA (voltage*amperage) where both V and A are in phase. I don't think this will be an issue since I'm pretty sure they are in phase for computers. Also, any decent power meter understands the difference.
I'm guessing you intend to do this for mobile devices. Your measurements will only be roughly the same from processor to processor. This is due to architectural differences from generation to generation, and from manufacturer to manufacturer.
Good luck.
Having looked for a description of the multicore design i keep finding several diagrams, but all of them look somewhat like this:
I know from looking at i7z command output that different cores can run at different frequencies.
This would suggest that the decisions regarding which core will be given a new process and for changing the frequency of the core itself are done either by the operating system or by the control block of the core itself.
My question is: What controls the frequencies of each individual core? Is the job of associating a READY process with the specific core placed upon the operating system or is it done by something within the processor.
Scheduling processes/threads to cores is purely up to the OS. The hardware has no understanding of tasks waiting to run. Maintaining the OS's list of processes that are runnable vs. waiting for I/O is completely a software thing.
Migrating a thread from one core to another is done by kernel code on the original core storing the architectural state to memory, then OS code on the new core restoring that saved state and resuming user-space execution.
Traditionally, frequency and voltage scaling decisions are made by the OS. Take Linux as an example: The decision-making code is called a governor (and also this arch wiki link came up high on google). It looks at things like how often processes have used their entire time slice on the current core. If the governor decides the CPU should run at a different speed, it programs some control registers to implement the change. As I understand it, the hardware takes care of choosing the right voltage to support the requested frequency.
As I understand it, the OS running on each core makes decisions independently. On hardware that allows each core to run at different frequencies, the decision-making code doesn't need to coordinate with each other. If running a high frequency on one core requires a high voltage chip-wide, the hardware takes care of that. I think the modern implementation of DVFS (dynamic voltage and frequency scaling) is fairly high-level, with the OS just telling the hardware which of N choices it wants, and the onboard power microcontroller taking care of the details of programming oscillators / clock dividers and voltage regulators.
Intel's "Turbo" feature, which opportunistically boosts the frequency above the max sustainable frequency, does the decision making in hardware. Any time the OS requests the highest advertised frequency, the CPU uses turbo when power and cooling allow.
Intel's Skylake takes this a step further: The OS can hand full control over DVFS to the hardware, optionally with constraints. That lets it react from microsecond to microsecond, rather than on a timescale of milliseconds. This does actually allow better performance in bursty workloads, because more power budget is available for turbo when it's useful. A few benchmarks are bursty enough to observe this, like some browser / javascript ones IIRC.
There was a whole talk about Skylake's new power management at IDF2015, check out the slides and/or archived webcast. The old method is described in a lot of detail there, too, to illustrate the difference, so you should really check it out if you want more detail than my summary. (The list of other IDF talks is here, thanks to Agner Fog's blog for the link)
The core frequency is controlled by a given voltage applied to a core's "oscillator".
This voltage can be changed by the Operating System but it can also be changed by the BIOS itself if a high temperature is detected in the CPU.
The question is obvious like specified in the title. I wonder this. Any expert can help?
OK, this is was going to be a long answer, so long that I may write an article about it instead. Strangely enough, I've been working on experiments that are closely related to your question -- determining performance per watt for a modern processor. As Paul and Sneftel indicated, it's not really possible with any real architecture today. You can probably compute this if you are looking at only the execution of that instruction given a certain silicon technology and a certain ALU design through calculating gate leakage and switching currents, voltages, etc. But that isn't a useful value because there is something always going on (from a HW perspective) in any processor newer than an 8086, and instructions haven't been executed in isolation since a pipeline first came into being.
Today, we have multi-function ALUs, out-of-order execution, multiple pipelines, hyperthreading, branch prediction, memory hierarchies, etc. What does this have to do with the execution of one ADD command? The energy used to execute one ADD command is different from the execution of multiple ADD commands. And if you wrap a program around it, then it gets really complicated.
So let's look at what you can do.
Statistically measure running a given add over and over again. Remember that there are many different types of adds such as integer adds, floating-point, double precision, adds with carries, and even simultaneous adds (SIMD) to name a few. Limits: OSs and other apps are always there, though you may be able to run on bare metal if you know how; varies with different hardware, silicon technologies, architecture, etc; probably not useful because it is so far from reality that it means little; limits of measurement equipment (using interprocessor PMUs, from the wall meters, interposer socket, etc); memory hierarchy; and more
Statistically measuring an integer/floating-point/double -based workload kernel. This is beginning to have some meaning because it means something to the community. Limits: Still not real; still varies with architecture, silicon technology, hardware, etc; measuring equipment limits; etc
Statistically measuring a real application. Limits: same as above but it at least means something to the community; power states come into play during periods of idle; potentially cluster issues come into play.
When I say "Limits", that just means you need to well define the constraints of your answer / experiment, not that it isn't useful.
SUMMARY: it is possible to come up with a value for one add but it doesn't really mean anything anymore. A value that means anything is way more complicated but is useful and requires a lot of work to find.
By the way, I do think it is a good and important question -- in part because it is so deceptively simple.
I often have to run calculation intensive simulations using Matlab. These simulations often take a long time and I expect my computer to use all its ressources in order for these simulations to be completed in as little time as possible.
However, when I open the Activity Monitor on my computer, processor usage is never above 55% and there is often about 1GB of unused RAM.
My question is: why is the processor not used to its full potential, and is there a safe and easy way to change this? Indeed, it would be great if I could get my simulations to be completed in half the time they currently take!
Its probably because you have a processer with multiple cores and that the code you are executing isn't written to run in multiple threads/processes. Unless you specifically write your code to take advantage of multiple cores it will only be able to use a single core at a single time.
A relatively easy way to enable parallel computing is to use the Parallel Computing Toolbox.
Additionally you might consider reading this: