Ghz to MIPS? Rough estimate anyone? - cpu-architecture

From the research I have done so far I learned that there the MIPS is highly dependent upon the application being run, or the language.
But can anyone give me their best guess for a 2.5 Ghz computer in MIPS? Or any other number of Ghz?
C++ if that helps.

MIPS stands for "Million Instructions Per Second", but that value becomes difficult to calculate for modern computers. Many processor architectures (such as x86 and x86_64, which make up most desktop and laptop computers) fall into the CISC category of processors. CISC architectures often contain instructions that perform several different tasks at once. One of the consequences of this is that some instructions take more clock cycles than other instructions. So even if you know your clock frequency (in this case 2.5 gigahertz), the number of instructions run per second depends mostly on which instructions a program uses. For this reason, MIPS has largely fallen out of use as a performance metric.

For some of my many benchmarks, identified in
http://www.roylongbottom.org.uk/
I produce an assembly code listing from which actual assembler instructions used can be calculated (Note that these are not actual micro instructions used by the RISC processors). The following includes %MIPS/MHz calculations based on these and other MIPS assumptions.
http://www.roylongbottom.org.uk/cpuspeed.htm
The results only apply for Intel CPUs. You will see that MIPS results depend on whether CPU, cache or RAM data is being used. For a modern CPU at 2500 MHz, likely MIPS are between 1250 and 9000 using CPU/L1 cache but much less accessing data in RAM. Then there are SSE SIMD integer instructions. Real integer MIPS for simple register based additions are in:
http://www.roylongbottom.org.uk/whatcpu%20results.htm#anchorC2D
Where my 2.4 GHz Core 2 CPU is shown to run at up to 17531 MIPS.
Roy

MIPS officially stands for Million Instructions Per Second but the Hacker's Dictionary defines it as Meaningless Indication of Processor Speed. This is because many companies use the theoretical maximum for marketing which is never achieved in real applications. E.g. current Intel processors can execute up to 4 instructions per cycle. Following this logic at 2.5 GHz it achieves 10,000 MIPS. In real applications, of course, this number is never achieved. Another problem, which slavik already mentions, is that instructions do different amounts of useful work. There are even NOPs, which–by definition–do nothing useful yet contribute to the MIPS rating.
To correct this people began using Dhrystone MIPS in the 1980s. Dhrystone is a synthetical benchmark (i.e. it is not based on a useful program) and one Dhrystone MIPS is defined relative to the benchmark performance of a VAX 11/780. This is only slightly less ridiculous than the definition above.
Today, performance is commonly measured by SPEC CPU benchmarks, which are based on real world programs. If you know these benchmarks and your own applications very well, you can make resonable predictions of performance without actually running your application on the CPU in question.
They key is to understand that performance will vary widely based on a number of characteristics. E.g. there used to be a program called The Many Faces of Go which essentially hard codes knowledge about the Board Game in many conditional if-clauses. The performance of this program is almost entirely determined by the branch predictor. Other programs use hughe amounts of memory that does not fit into any cache. The performance of these programs is determined by the bandwidth and/or latency of the main memory. Some applications may depend heavily on the throughput of floating point instructions while other applications never use any floating point instructions. You get the idea. An accurate prediction is impossible without knowing the application.
Having said all that, an average number would be around 2 instructions per cycle and 5,000 MIPS # 2.5 GHz. However, real numbers can be easily ten or even a hundred times lower.

Related

perf power consumption measure: How does it work?

I noticed that perf list now has the option to measure power consumption. You can use it as follows:
$ perf stat -e power/energy-cores/ ./a.out
Performance counter stats for 'system wide':
8.55 Joules power/energy-cores/
0.949871058 seconds time elapsed
How accurate is this measurement, and how does perf estimate the power consumption?
The power/energy-cores/ perf counter is based on an MSR register called MSR_PP0_ENERGY_STATUS, which is part of the Intel RAPL interface (Intel seems to call each individual RAPL MSR a RAPL interface). A complicated model based on system activity events is used to estimate (static and dynamic) energy consumption. The MSR register name has PP0 in it, which refers to power plane 0, which is one of the RAPL domains that contains all the cores of the socket including the private caches of the cores. PP0, however, excludes the last-level cache, the interconnect, the memory controller, the graphics processor, and everything else that is in the uncore. It's impossible to measure the accuracy of MSR_PP0_ENERGY_STATUS because there is no other way to estimate the energy consumption of power plane 0 only.
It's possible to measure the accuracy of other RAPL domains though. These include the Package, DRAM, and PSys domains. For example, the accuracy of the Package domain energy estimation can be measured by comparing against the energy consumption of the whole system (which can be measured using a power meter) and running a workload that keeps the energy consumption of everything outside the package a known constant as much as possible. The accuracy of MSR_PKG_ENERGY_STATUS and MSR_DRAM_ENERGY_STATUS have been measured in different ways by different people on many different processors. You can refer to the recent paper entitled RAPL in Action: Experiences in Using RAPL for Power Measurements for more information, which also includes summaries of previous works. The paper covers Sandy Bridge, Ivy Bridge, Haswell, and Skylake. The conclusion is that MSR_PKG_ENERGY_STATUS and MSR_DRAM_ENERGY_STATUS appear to be accurate on Haswell and Skylake (the implementation has changed on Haswell, see : An Energy Efficiency Feature Survey of the Intel Haswell Processor). But this is not necessarily true on all kinds of workloads, P states, and processors. So the accuracy does not just depend on the microarchitecture.
The RAPL interface is discussed in Section 14.9 of the Intel Manual Volume 3. I noticed there are errors in the section. For example, it says client processors don't support the DRAM domain, which is not true. The client Haswell processor I'm using to write this answer supports the DRAM domain. The section is probably outdated and applies only Sandy Bridge and Ivy Bridge processors. I think it's better to read the datasheet of the processor on which you want to use RAPL.
The power/energy-pkg/ perf counter can be used to measure energy consumption of the package domain. This is the only domain that is known be supported on all Intel processors starting from Sandy Bridge.
On x86 systems, these values are based on RAPL (Running Average Power Limit) - an interface that provides built in CPU energy counters. While originally designed by Intel, AMD also provides a compatible interface on Zen systems.
The accuracy depends on the actual microarchitecture. Originally, RAPL was backed by a model with certain biases. On Intel CPUs since the Haswell architecture, it is based on measurements which are quite accurate. As far as I know there is no good understanding of the accuracy on AMD's Zen RAPL implementation.
One important thing you have to consider is the scope of the measurements. On most systems, only package and DRAM is covered1. So if you need to know how much power / energy your entire system consumes - you usually cannot easily answer that with RAPL.
Also note that RAPL is updated every 1 ms, so short workloads will have significant errors from the update rate.
1 - Skylake Desktop systems can implement a full-system RAPL. It's accuracy depends on the manufacturer.

What is the point of on-chip hardware accelerators, instead of that functionality being added as an instruction to the ISA?

I get that if a specialized operation is known to be common, it makes sense to do it in hardware. But at that point, why not make it a part of the ISA so it can be even faster?
Is there a benefit to making it a co-processor that communicates through shared memory?
This is a bit hand-wavy because I don't actually design hardware, but I think I know enough to say something that's at least plausible.
Adding it to the ISA means it has to be fairly tightly coupled to the pipeline, which doesn't fit well for things like integrated GPUs that have some specialized hardware and can filter out which pixels even need to be processed using dedicated hardware instead of software branching.
Even considering less complicated accelerators (e.g. for crypto):
Especially on simpler CPUs without out-of-order exec and large reordering windows, high-latency HW accelerators could stall the pipeline and stop it from getting other work done while waiting for a result.
Intel does tend to add things to the ISA, such as AES and SHA, because mainstream x86 CPUs do have the instruction throughput and vector registers to feed data to execution units that do one round of AES, for example.
If the accelerator is physically large but usually not needed by multiple cores at once, having groups of cores share one is more natural with some kind of co-processor arrangement to insulate the core from the round-trip latency of going off-core to compute something.
Also for GPUs, a GPU has more computational throughput than you can fit down the superscalar pipeline of a normal CPU. The FLOPS of an integrated GPU is typically much greater than a single core of a modern Intel CPU, even with 2x 256-bit FMA units. So you'd need to have a CPU instruction like "run shader" that runs a GPU program using its own separately-programmable machine code. GPU instruction scheduling is lighter weight than even a normal in-order CPU.

MATLAB program simulation with the given processor requirements

I have a system with configuration intel(R) core(TM) i3-5020U CPU # 2.2 GHz,4GB RAM. But in order to compare the performance of my MATLAB program in terms of execution time, I need to execute it on a machine with configuration Intel(R) Core(TM) i5-3570 CPU # 3.40GHz, 16 GB RAM. Is there a way to perform this kind of simulation?
TL:DR: No. Performance differences between Broadwell and IvyBridge depend on lots of complicated details. (See Agner Fog's microarch pdf for the low-level microarchitectural details, and also other stuff in the x86 tag wiki)
It's likely that performance will scale with either clock speed or memory speed within maybe 10%, even between different microarchitectures, but it might not.
Using your own system, you can probably figure out how your code scales with CPU frequency, by forcing it to stay at minimum frequency for a test run. If it's a lot less than perfect scaling, then memory speed is a big factor. (The slower your CPU, the fewer cycles are spent waiting for memory.)
You can't extrapolate IvB i5 3.4GHz performance from BDW 2.2GHz performance without knowing a lot more details about exactly what your code bottlenecks on. It's possible that it bottlenecks on the same simple thing on both CPUs, in which case you could extrapolate. e.g. if it turns out that it bottlenecks on FP multiply latency, then run-time on IvB would be 5/3rds the run time on Broadwell (times the clock frequency ratio), since BDW has 3 cycle FP multiply and add, but SnB/IvB/Haswell have 5 cycle multiply. (FMA is 5 cycles on BDW, if I recall correctly. IvB doesn't support FMA, so if Matlab takes advantage of that on BDW, it's not even running the same machine code).
More likely, it's not that simple and cache / memory performance comes into it, too. Haswell/Broadwell don't have L1 cache-bank conflicts, but SnB/IvB do.
Depending on how you run the workload on the i5 CPU, it might or might not be able to turbo up to higher than its rated 3.4GHz, further confounding any attempt to guess at performance.
It's hard to tell with different computers to measure practical efficiency. That's why you usually use theoretical efficiency with Big-O, check the wiki page for algorithm efficiency and Big-O notation.
In the case you have access to both codes (yours, and the other guy's code), you can test them in the same computer with the methods for measuring performance proposed by mathworks, which are mainly time functions in real time and cpu time.
Lastly, you can see here several challenges about benchmarking that might be interesting to consider.

64-bit Advantages for Discrete Event Simulation

As I understand it, Intel 64-bit CPUs offer the ability to address a larger address space (>4GB), which is useful for a large simulation. Interesting architectural hardware advantages::
16 general purpose registers instead of 8
Additional SSE registers
A no execute (NX) bit to prevent buffer overrun attacks
BACKGROUND
Historically, the simulations have been performed on 32-bit IA (Intel Architecture) systems. I am wondering if where (if any) is opportunity to reduce simulation times with 64-bit CPUs: I expect that software should be recompiled to take advantage of 64-bit capability. This type of simulation would not benefit from a MAC (multiply and accumulate) nor does it use floating point calculations.
QUESTION
That being said, is there an Intel 64-bit instruction or capability that offers an appreciable advantage over the 32-bit instructions set that would accelerate simulation (computationally intensive and lengthy 32-BIT algorithms)?
If you have experience implementing simulations and have transitioned from 32 to 64 bit CPUs, please state this in your response (relevant experience is important). I look forward to insightful responses from the community
The most immediate computational benefits to expect regarding CPU instructions I can think of would be AVX although this is only loosely related to x86_64, but more of an CPU-generational issue.
In our company, we developed multiple, highly-complex discrete event simulations, simulating aircraft (including electrics, hydraulics, avionics software and everything related). They are all built with or ported to x86_64. The reasons are mostly due to memory addressing, allowing for larger caches and wider choice of algorithms (e.g. data-centric design, concurrency), graphics content also tends to be huge nowadays. However, optimizations regarding x86_64 instructions themselves, such as AVX, are left to compilers. I never saw code written in assembler or using compiler intrinsics to actually refer to specific x86_64 instructions explicitly.
To summarize, based on my experience, x86_64 CPUs allow for certain optimizations, often sacrificing memory consumption in favor of CPU processing:
Wider choice of algorithms, especially regarding concurrency, where data may need to be laid out in a way favoring parallel processing at the cost of occupied memory
Intermediate results or other processing output may be cached more easily in memory to avoid recomputation or to optimize for temporal or state-related coherence
AVX instructions may help compilers to vectorize more code than with MMX/SSE

What is responsible for changing core's load and frequency in multicore processor

Having looked for a description of the multicore design i keep finding several diagrams, but all of them look somewhat like this:
I know from looking at i7z command output that different cores can run at different frequencies.
This would suggest that the decisions regarding which core will be given a new process and for changing the frequency of the core itself are done either by the operating system or by the control block of the core itself.
My question is: What controls the frequencies of each individual core? Is the job of associating a READY process with the specific core placed upon the operating system or is it done by something within the processor.
Scheduling processes/threads to cores is purely up to the OS. The hardware has no understanding of tasks waiting to run. Maintaining the OS's list of processes that are runnable vs. waiting for I/O is completely a software thing.
Migrating a thread from one core to another is done by kernel code on the original core storing the architectural state to memory, then OS code on the new core restoring that saved state and resuming user-space execution.
Traditionally, frequency and voltage scaling decisions are made by the OS. Take Linux as an example: The decision-making code is called a governor (and also this arch wiki link came up high on google). It looks at things like how often processes have used their entire time slice on the current core. If the governor decides the CPU should run at a different speed, it programs some control registers to implement the change. As I understand it, the hardware takes care of choosing the right voltage to support the requested frequency.
As I understand it, the OS running on each core makes decisions independently. On hardware that allows each core to run at different frequencies, the decision-making code doesn't need to coordinate with each other. If running a high frequency on one core requires a high voltage chip-wide, the hardware takes care of that. I think the modern implementation of DVFS (dynamic voltage and frequency scaling) is fairly high-level, with the OS just telling the hardware which of N choices it wants, and the onboard power microcontroller taking care of the details of programming oscillators / clock dividers and voltage regulators.
Intel's "Turbo" feature, which opportunistically boosts the frequency above the max sustainable frequency, does the decision making in hardware. Any time the OS requests the highest advertised frequency, the CPU uses turbo when power and cooling allow.
Intel's Skylake takes this a step further: The OS can hand full control over DVFS to the hardware, optionally with constraints. That lets it react from microsecond to microsecond, rather than on a timescale of milliseconds. This does actually allow better performance in bursty workloads, because more power budget is available for turbo when it's useful. A few benchmarks are bursty enough to observe this, like some browser / javascript ones IIRC.
There was a whole talk about Skylake's new power management at IDF2015, check out the slides and/or archived webcast. The old method is described in a lot of detail there, too, to illustrate the difference, so you should really check it out if you want more detail than my summary. (The list of other IDF talks is here, thanks to Agner Fog's blog for the link)
The core frequency is controlled by a given voltage applied to a core's "oscillator".
This voltage can be changed by the Operating System but it can also be changed by the BIOS itself if a high temperature is detected in the CPU.