For a Single Cycle CPU How Much Energy Required For Execution Of ADD Command - cpu-architecture

The question is obvious like specified in the title. I wonder this. Any expert can help?

OK, this is was going to be a long answer, so long that I may write an article about it instead. Strangely enough, I've been working on experiments that are closely related to your question -- determining performance per watt for a modern processor. As Paul and Sneftel indicated, it's not really possible with any real architecture today. You can probably compute this if you are looking at only the execution of that instruction given a certain silicon technology and a certain ALU design through calculating gate leakage and switching currents, voltages, etc. But that isn't a useful value because there is something always going on (from a HW perspective) in any processor newer than an 8086, and instructions haven't been executed in isolation since a pipeline first came into being.
Today, we have multi-function ALUs, out-of-order execution, multiple pipelines, hyperthreading, branch prediction, memory hierarchies, etc. What does this have to do with the execution of one ADD command? The energy used to execute one ADD command is different from the execution of multiple ADD commands. And if you wrap a program around it, then it gets really complicated.
SORT-OF-AN-ANSWER:
So let's look at what you can do.
Statistically measure running a given add over and over again. Remember that there are many different types of adds such as integer adds, floating-point, double precision, adds with carries, and even simultaneous adds (SIMD) to name a few. Limits: OSs and other apps are always there, though you may be able to run on bare metal if you know how; varies with different hardware, silicon technologies, architecture, etc; probably not useful because it is so far from reality that it means little; limits of measurement equipment (using interprocessor PMUs, from the wall meters, interposer socket, etc); memory hierarchy; and more
Statistically measuring an integer/floating-point/double -based workload kernel. This is beginning to have some meaning because it means something to the community. Limits: Still not real; still varies with architecture, silicon technology, hardware, etc; measuring equipment limits; etc
Statistically measuring a real application. Limits: same as above but it at least means something to the community; power states come into play during periods of idle; potentially cluster issues come into play.
When I say "Limits", that just means you need to well define the constraints of your answer / experiment, not that it isn't useful.
SUMMARY: it is possible to come up with a value for one add but it doesn't really mean anything anymore. A value that means anything is way more complicated but is useful and requires a lot of work to find.
By the way, I do think it is a good and important question -- in part because it is so deceptively simple.

Related

Define the minimal configuration for a program

So I have my .exe ready to deploy, and for distribution, I need to know the minimal requirements for my program to run on a machine... and I really don't know how to do that.
Is there a way to know that ? Some kind of benchmark ? Or must I just set things as I think it'll work ?
Maybe should I just buy all existing components until I find the minimal ? :')
Well, thanks for your answers.
Start by seeing the first Windows' version you can deploy on (Windows XP? Vista?).
If your program is cpu or gpu intensive, and has a fixed time loop (eg. game) then you'll have to do benchmarks.
You should look at several old vs new CPUs/GPUs and trying to "guess" based on online specs posted online what the minimal requirement is. For example, if your program can't run on an old cpu, but runs blazingly fast on a new one, try to find the model that -barely- runs it, which will obviously be one somewhere in the middle.
If your program requires other special things, specify them (eg. USB 3.0, controllers supported...).
Otherwise, if your program loads slower but doesn't have runtime issues, the minimum specs should be indicative of a reasonable loading time (a minute seems to be the standard now, sadly).
Additionally, if your program is memory hungry (both hard drive or RAM), you must indicate this.
For hard drive memory, simply state your program size, along with the files included with it.
For RAM, use a profiler - it will tell you how much memory your program is using.
I've completely skipped over the fact that, in some computers, the bottleneck might be the cpu, and it might be the gpu in others. You need to know which is the bottleneck to make your judgement.
To find out is a rather simple process - remove expensive gpu operations (lower texture resolutions, turn off shaders). If the program still runs slowly, then the bottleneck is the cpu.
edit: this is a simplification of the problem and hardware is a little more complicated than this (slower multi-core cpus vs faster one-core cpus vary their performance depending on how many cores a program uses and how / a program may require a gpu to have less memory but more processing power, or the opposite... even heat dissipation can affect component efficiency: your program might run fine for 20 minutes but start to slow down if the cpu isn't cooled down properly), but "minimal hardware requirements" aren't exactly precise so this method is appropriate.
tl;dr of the spoiler:
In short, there are so many factors that affect performance that you can't measure it, so just a rough estimation is good.

Measure the electricity consumed by a browser to render a webpage

Is there a way to calculate the electricity consumed to load and render a webpage (frontend)? I was thinking of a 'test' made with phantomjs for example:
load a web page
scroll to the bottom
And measure how much electricity was needed. I can perhaps extrapolate from CPU cycle. But phantomjs is headless, rendering in real browser is certainly different. Perhaps it's impossible to do real measurements.. but with an index it may be possible to compare websites.
Do you have other suggestions?
It's pretty much impossible to measure this internally in modern processors (anything more recent than 286). By internally, I mean by counting cycles. This is because different parts of the processor consume different levels of energy per cycle depending upon the instruction.
That said, you can make your measurements. Stick a power meter between the wall and the processor. Here's a procedure:
Measure the baseline energy usage, i.e. nothing running except the OS and the browser, and the browser completely static (i.e. not doing anything). You need to make sure that everything is stead state (SS) meaning start your measurements only after several minutes of idle.
Measure the usage doing the operation you want. Again, you want to avoid any start up and stopping work, so make sure you start measuring at least 15 seconds after you start the operation. Stopping isn't an issue since the browser will execute any termination code after you finish your measurement.
Sounds simple, right? Unfortunately, because of the nature of your measurements, there are some gotchas.
Do you recall your physics classes (or EE classes) that talked about signal to noise ratios? Well, a scroll down uses very little energy, so the signal (scrolling) is well in the noise (normal background processes). This means you have to take a LOT of samples to get anything useful.
Your browser startup energy usage, or anything else that uses a decent amount of processing, is much easier to measure (better signal to noise ratio).
Also, make sure you understand the underlying electronics. For example, power is VA (voltage*amperage) where both V and A are in phase. I don't think this will be an issue since I'm pretty sure they are in phase for computers. Also, any decent power meter understands the difference.
I'm guessing you intend to do this for mobile devices. Your measurements will only be roughly the same from processor to processor. This is due to architectural differences from generation to generation, and from manufacturer to manufacturer.
Good luck.

When to use Policy Iteration instead of Value Iteration

I'm currently studying dynamic programming solutions to Markov Decision Processes. I feel like I've got a decent grip on VI and PI and the motivation for PI is pretty clear to me (converging on the correct state utilities seems like unnecessary work when all we need is the correct policy). However, none of my experiments show PI in a favourable light in terms of runtime. It seems to consistently take longer regardless of the size of the state space and discount factor.
This could be due the implementation (I'm using the BURLAP library), or poor experimentation on my part. However, even the trends don't seem to show a benefit. It should be noted that the BURLAP implementation of PI is actually "modified policy iteration" which runs a limited VI variant at each iteration. My question to you is do you know of any situations, theoretical or practical, in which (modified) PI should outperform VI?
Turns out that Policy Iteration, specifically Modified Policy Iteration, can outperform Value Iteration when the discount factor (gamma) is very high.
http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a.pdf

Why predict a branch, instead of simply executing both in parallel?

I believe that when creating CPUs, branch prediction is a major slow down when the wrong branch is chosen. So why do CPU designers choose a branch instead of simply executing both branches, then cutting one off once you know for sure which one was chosen?
I realize that this could only go 2 or 3 branches deep within a short number of instructions or the number of parallel stages would get ridiculously large, so at some point you would still need some branch prediction since you definitely will run across larger branches, but wouldn't a couple stages like this make sense? Seems to me like it would significantly speed things up and be worth a little added complexity.
Even just a single branch deep would almost half the time eaten up by wrong branches, right?
Or maybe it is already somewhat done like this? Branches usually only choose between two choices when you get down to assembly, correct?
You're right in being afraid of exponentially filling the machine, but you underestimate the power of that.
A common rule-of-thumb says you can expect to have ~20% branches on average in your dynamic code. This means one branch in every 5 instructions. Most CPUs today have a deep out-of-order core that fetches and executes hundreds of instructions ahead - take Intels' Haswell for e.g., it has a 192 entries ROB, meaning you can hold at most 4 levels of branches (at that point you'll have 16 "fronts" and 31 "blocks" including a single bifurcating branch each - assuming each block would have 5 instructions you've almost filled your ROB, and another level would exceed it). At that point you would have progressed only to an effective depth of ~20 instructions, rendering any instruction-level parallelism useless.
If you want to diverge on 3 levels of branches, it means you're going ot have 8 parallel contexts, each would have only 24 entries available to run ahead. And even that's only when you ignore overheads for rolling back 7/8 of your work, the need to duplicate all state-saving HW (like registers, which you have dozens of), and the need to split other resources into 8 parts like you did with the ROB. Also, that's not counting memory management which would have to manage complicated versioning, forwarding, coherency, etc.
Forget about power consumption, even if you could support that wasteful parallelism, spreading your resources that thin would literally choke you before you could advance more than a few instructions on each path.
Now, let's examine the more reasonable option of splitting over a single branch - this is beginning to look like Hyperthreading - you split/share your core resources over 2 contexts. This feature has some performance benefits, granted, but only because both context are non-speculative. As it is, I believe the common estimation is around 10-30% over running the 2 contexts one after the other, depending on the workload combination (numbers from a review by AnandTech here) - that's nice if you indeed intended to run both the tasks one after the other, but not when you're about to throw away the results of one of them. Even if you ignore the mode switch overhead here, you're gaining 30% only to lose 50% - no sense in that.
On the other hand, you have the option of predicting the branches (modern predictors today can reach over 95% success rate on average), and paying the penalty of misprediction, which is partially hidden already by the out-of-order engine (some instructions predating the branch may execute after it's cleared, most OOO machines support that). This leaves any deep out-of-order engine free to roam ahead, speculating up to its full potential depth, and being right most of the time. The odds of flusing some of the work here do decrease geometrically (95% after the first branch, ~90% after the second, etc..), but the flush penalty also decreases. It's still far better than a global efficiency of 1/n (for n levels of bifurcation).

How to find the time value of operation to optimize new algorithm design?

My question is specific to iPhone, iPod, and iPad, since I am assuming that the architecture makes a big difference. I'm hoping there is either a specification somewhere (for the various chips perhaps), or a reliable way to measure T for each specific instruction. I know I can use any number of tools to measure aggregate processor time used, memory used, etc. I want to quantify at a lower level.
So, I'm able to figure out how many times I go through the main part of the algorithm. For example, I iterate n * (n-1) times in a naive implementation, and between n (best case) and n + n * (n-1) (worst case) in another. I can also make a reasonable count of the total number of instructions (+ - = % * /, and logic statements), and I can compare those counts, but that's assuming the weight of each operation is the same. Also, I don't have any idea how to weight the actual time value of a logic statement (if, else, for, while) vs a mathematical operator... is "if" as much work as "+" each time I use it? I would love to know where to find this information.
So, for clarity, my goal is to discover how much processor time I am demanding of the CPU (or GPU or any U) so that I can design an optimal algorithm around processor time. Can someone give me an idea of where to start for iOS hardware?
Edit: This link to ClockServices.c and SIMD stuff in the developer portal might be a good start for people interested in this. A few more cups of coffee tonight and I might get through it ;)
On a modern platform, processor time isn't the only limiting factor. Often, memory access is.
Still, processor time:
Your basic approach at an estimation for the processor load is OK, though, and is sensible: Make a rough estimate of the cost based on your knowledge of typical platforms.
In this article, Table 1 shows the times for typical primitive operations in .NET. While your platform may vary, the relative time is usually very similar. Maybe you can find - or even make - one for iStuff.
(I haven't come across one so thorough for other platforms, except processor / instruction set manuals, but they deal with assembly instructions)
memory locality:
A cache miss can cost you hundreds of cycles, a disk access a thousand times as much. So controlling your memory access patterns (i.e. reducing the working set, restructuring and accessing data in a cache-friendly way) is an important part of evaluating an algorithm.
xCode has instruments to measure performance of each function/operation, you can simply use them.