What is responsible for changing core's load and frequency in multicore processor - operating-system

Having looked for a description of the multicore design i keep finding several diagrams, but all of them look somewhat like this:
I know from looking at i7z command output that different cores can run at different frequencies.
This would suggest that the decisions regarding which core will be given a new process and for changing the frequency of the core itself are done either by the operating system or by the control block of the core itself.
My question is: What controls the frequencies of each individual core? Is the job of associating a READY process with the specific core placed upon the operating system or is it done by something within the processor.

Scheduling processes/threads to cores is purely up to the OS. The hardware has no understanding of tasks waiting to run. Maintaining the OS's list of processes that are runnable vs. waiting for I/O is completely a software thing.
Migrating a thread from one core to another is done by kernel code on the original core storing the architectural state to memory, then OS code on the new core restoring that saved state and resuming user-space execution.
Traditionally, frequency and voltage scaling decisions are made by the OS. Take Linux as an example: The decision-making code is called a governor (and also this arch wiki link came up high on google). It looks at things like how often processes have used their entire time slice on the current core. If the governor decides the CPU should run at a different speed, it programs some control registers to implement the change. As I understand it, the hardware takes care of choosing the right voltage to support the requested frequency.
As I understand it, the OS running on each core makes decisions independently. On hardware that allows each core to run at different frequencies, the decision-making code doesn't need to coordinate with each other. If running a high frequency on one core requires a high voltage chip-wide, the hardware takes care of that. I think the modern implementation of DVFS (dynamic voltage and frequency scaling) is fairly high-level, with the OS just telling the hardware which of N choices it wants, and the onboard power microcontroller taking care of the details of programming oscillators / clock dividers and voltage regulators.
Intel's "Turbo" feature, which opportunistically boosts the frequency above the max sustainable frequency, does the decision making in hardware. Any time the OS requests the highest advertised frequency, the CPU uses turbo when power and cooling allow.
Intel's Skylake takes this a step further: The OS can hand full control over DVFS to the hardware, optionally with constraints. That lets it react from microsecond to microsecond, rather than on a timescale of milliseconds. This does actually allow better performance in bursty workloads, because more power budget is available for turbo when it's useful. A few benchmarks are bursty enough to observe this, like some browser / javascript ones IIRC.
There was a whole talk about Skylake's new power management at IDF2015, check out the slides and/or archived webcast. The old method is described in a lot of detail there, too, to illustrate the difference, so you should really check it out if you want more detail than my summary. (The list of other IDF talks is here, thanks to Agner Fog's blog for the link)

The core frequency is controlled by a given voltage applied to a core's "oscillator".
This voltage can be changed by the Operating System but it can also be changed by the BIOS itself if a high temperature is detected in the CPU.

Related

perf power consumption measure: How does it work?

I noticed that perf list now has the option to measure power consumption. You can use it as follows:
$ perf stat -e power/energy-cores/ ./a.out
Performance counter stats for 'system wide':
8.55 Joules power/energy-cores/
0.949871058 seconds time elapsed
How accurate is this measurement, and how does perf estimate the power consumption?
The power/energy-cores/ perf counter is based on an MSR register called MSR_PP0_ENERGY_STATUS, which is part of the Intel RAPL interface (Intel seems to call each individual RAPL MSR a RAPL interface). A complicated model based on system activity events is used to estimate (static and dynamic) energy consumption. The MSR register name has PP0 in it, which refers to power plane 0, which is one of the RAPL domains that contains all the cores of the socket including the private caches of the cores. PP0, however, excludes the last-level cache, the interconnect, the memory controller, the graphics processor, and everything else that is in the uncore. It's impossible to measure the accuracy of MSR_PP0_ENERGY_STATUS because there is no other way to estimate the energy consumption of power plane 0 only.
It's possible to measure the accuracy of other RAPL domains though. These include the Package, DRAM, and PSys domains. For example, the accuracy of the Package domain energy estimation can be measured by comparing against the energy consumption of the whole system (which can be measured using a power meter) and running a workload that keeps the energy consumption of everything outside the package a known constant as much as possible. The accuracy of MSR_PKG_ENERGY_STATUS and MSR_DRAM_ENERGY_STATUS have been measured in different ways by different people on many different processors. You can refer to the recent paper entitled RAPL in Action: Experiences in Using RAPL for Power Measurements for more information, which also includes summaries of previous works. The paper covers Sandy Bridge, Ivy Bridge, Haswell, and Skylake. The conclusion is that MSR_PKG_ENERGY_STATUS and MSR_DRAM_ENERGY_STATUS appear to be accurate on Haswell and Skylake (the implementation has changed on Haswell, see : An Energy Efficiency Feature Survey of the Intel Haswell Processor). But this is not necessarily true on all kinds of workloads, P states, and processors. So the accuracy does not just depend on the microarchitecture.
The RAPL interface is discussed in Section 14.9 of the Intel Manual Volume 3. I noticed there are errors in the section. For example, it says client processors don't support the DRAM domain, which is not true. The client Haswell processor I'm using to write this answer supports the DRAM domain. The section is probably outdated and applies only Sandy Bridge and Ivy Bridge processors. I think it's better to read the datasheet of the processor on which you want to use RAPL.
The power/energy-pkg/ perf counter can be used to measure energy consumption of the package domain. This is the only domain that is known be supported on all Intel processors starting from Sandy Bridge.
On x86 systems, these values are based on RAPL (Running Average Power Limit) - an interface that provides built in CPU energy counters. While originally designed by Intel, AMD also provides a compatible interface on Zen systems.
The accuracy depends on the actual microarchitecture. Originally, RAPL was backed by a model with certain biases. On Intel CPUs since the Haswell architecture, it is based on measurements which are quite accurate. As far as I know there is no good understanding of the accuracy on AMD's Zen RAPL implementation.
One important thing you have to consider is the scope of the measurements. On most systems, only package and DRAM is covered1. So if you need to know how much power / energy your entire system consumes - you usually cannot easily answer that with RAPL.
Also note that RAPL is updated every 1 ms, so short workloads will have significant errors from the update rate.
1 - Skylake Desktop systems can implement a full-system RAPL. It's accuracy depends on the manufacturer.

What is the point of on-chip hardware accelerators, instead of that functionality being added as an instruction to the ISA?

I get that if a specialized operation is known to be common, it makes sense to do it in hardware. But at that point, why not make it a part of the ISA so it can be even faster?
Is there a benefit to making it a co-processor that communicates through shared memory?
This is a bit hand-wavy because I don't actually design hardware, but I think I know enough to say something that's at least plausible.
Adding it to the ISA means it has to be fairly tightly coupled to the pipeline, which doesn't fit well for things like integrated GPUs that have some specialized hardware and can filter out which pixels even need to be processed using dedicated hardware instead of software branching.
Even considering less complicated accelerators (e.g. for crypto):
Especially on simpler CPUs without out-of-order exec and large reordering windows, high-latency HW accelerators could stall the pipeline and stop it from getting other work done while waiting for a result.
Intel does tend to add things to the ISA, such as AES and SHA, because mainstream x86 CPUs do have the instruction throughput and vector registers to feed data to execution units that do one round of AES, for example.
If the accelerator is physically large but usually not needed by multiple cores at once, having groups of cores share one is more natural with some kind of co-processor arrangement to insulate the core from the round-trip latency of going off-core to compute something.
Also for GPUs, a GPU has more computational throughput than you can fit down the superscalar pipeline of a normal CPU. The FLOPS of an integrated GPU is typically much greater than a single core of a modern Intel CPU, even with 2x 256-bit FMA units. So you'd need to have a CPU instruction like "run shader" that runs a GPU program using its own separately-programmable machine code. GPU instruction scheduling is lighter weight than even a normal in-order CPU.

Measure the electricity consumed by a browser to render a webpage

Is there a way to calculate the electricity consumed to load and render a webpage (frontend)? I was thinking of a 'test' made with phantomjs for example:
load a web page
scroll to the bottom
And measure how much electricity was needed. I can perhaps extrapolate from CPU cycle. But phantomjs is headless, rendering in real browser is certainly different. Perhaps it's impossible to do real measurements.. but with an index it may be possible to compare websites.
Do you have other suggestions?
It's pretty much impossible to measure this internally in modern processors (anything more recent than 286). By internally, I mean by counting cycles. This is because different parts of the processor consume different levels of energy per cycle depending upon the instruction.
That said, you can make your measurements. Stick a power meter between the wall and the processor. Here's a procedure:
Measure the baseline energy usage, i.e. nothing running except the OS and the browser, and the browser completely static (i.e. not doing anything). You need to make sure that everything is stead state (SS) meaning start your measurements only after several minutes of idle.
Measure the usage doing the operation you want. Again, you want to avoid any start up and stopping work, so make sure you start measuring at least 15 seconds after you start the operation. Stopping isn't an issue since the browser will execute any termination code after you finish your measurement.
Sounds simple, right? Unfortunately, because of the nature of your measurements, there are some gotchas.
Do you recall your physics classes (or EE classes) that talked about signal to noise ratios? Well, a scroll down uses very little energy, so the signal (scrolling) is well in the noise (normal background processes). This means you have to take a LOT of samples to get anything useful.
Your browser startup energy usage, or anything else that uses a decent amount of processing, is much easier to measure (better signal to noise ratio).
Also, make sure you understand the underlying electronics. For example, power is VA (voltage*amperage) where both V and A are in phase. I don't think this will be an issue since I'm pretty sure they are in phase for computers. Also, any decent power meter understands the difference.
I'm guessing you intend to do this for mobile devices. Your measurements will only be roughly the same from processor to processor. This is due to architectural differences from generation to generation, and from manufacturer to manufacturer.
Good luck.

Ghz to MIPS? Rough estimate anyone?

From the research I have done so far I learned that there the MIPS is highly dependent upon the application being run, or the language.
But can anyone give me their best guess for a 2.5 Ghz computer in MIPS? Or any other number of Ghz?
C++ if that helps.
MIPS stands for "Million Instructions Per Second", but that value becomes difficult to calculate for modern computers. Many processor architectures (such as x86 and x86_64, which make up most desktop and laptop computers) fall into the CISC category of processors. CISC architectures often contain instructions that perform several different tasks at once. One of the consequences of this is that some instructions take more clock cycles than other instructions. So even if you know your clock frequency (in this case 2.5 gigahertz), the number of instructions run per second depends mostly on which instructions a program uses. For this reason, MIPS has largely fallen out of use as a performance metric.
For some of my many benchmarks, identified in
http://www.roylongbottom.org.uk/
I produce an assembly code listing from which actual assembler instructions used can be calculated (Note that these are not actual micro instructions used by the RISC processors). The following includes %MIPS/MHz calculations based on these and other MIPS assumptions.
http://www.roylongbottom.org.uk/cpuspeed.htm
The results only apply for Intel CPUs. You will see that MIPS results depend on whether CPU, cache or RAM data is being used. For a modern CPU at 2500 MHz, likely MIPS are between 1250 and 9000 using CPU/L1 cache but much less accessing data in RAM. Then there are SSE SIMD integer instructions. Real integer MIPS for simple register based additions are in:
http://www.roylongbottom.org.uk/whatcpu%20results.htm#anchorC2D
Where my 2.4 GHz Core 2 CPU is shown to run at up to 17531 MIPS.
Roy
MIPS officially stands for Million Instructions Per Second but the Hacker's Dictionary defines it as Meaningless Indication of Processor Speed. This is because many companies use the theoretical maximum for marketing which is never achieved in real applications. E.g. current Intel processors can execute up to 4 instructions per cycle. Following this logic at 2.5 GHz it achieves 10,000 MIPS. In real applications, of course, this number is never achieved. Another problem, which slavik already mentions, is that instructions do different amounts of useful work. There are even NOPs, which–by definition–do nothing useful yet contribute to the MIPS rating.
To correct this people began using Dhrystone MIPS in the 1980s. Dhrystone is a synthetical benchmark (i.e. it is not based on a useful program) and one Dhrystone MIPS is defined relative to the benchmark performance of a VAX 11/780. This is only slightly less ridiculous than the definition above.
Today, performance is commonly measured by SPEC CPU benchmarks, which are based on real world programs. If you know these benchmarks and your own applications very well, you can make resonable predictions of performance without actually running your application on the CPU in question.
They key is to understand that performance will vary widely based on a number of characteristics. E.g. there used to be a program called The Many Faces of Go which essentially hard codes knowledge about the Board Game in many conditional if-clauses. The performance of this program is almost entirely determined by the branch predictor. Other programs use hughe amounts of memory that does not fit into any cache. The performance of these programs is determined by the bandwidth and/or latency of the main memory. Some applications may depend heavily on the throughput of floating point instructions while other applications never use any floating point instructions. You get the idea. An accurate prediction is impossible without knowing the application.
Having said all that, an average number would be around 2 instructions per cycle and 5,000 MIPS # 2.5 GHz. However, real numbers can be easily ten or even a hundred times lower.

profiling, How to avoid FPU (hardware) in VS pro 2008 or 2012RC and use emulated FPU codes

I need to optimise some codes for Cortex-M3 processor which doesn't have FP unit. I'm completely new to domain of optimisation.anyways,I use VS 2012 Release Candidate for native compiling of codes on my pc(Intel Core i5, windows 7 as os)and then porting them to Cortex_M3.I tried to write my codes in a way that it uses as little as possible the floating point arithmetics.but I still have a few. so i know that when i embedd it in Cortex_M3, it will take advantage of emulated FPU codes instead (Software FPU). Since i'm not able to do profiling for cortex_m3, i did it on my PC using VS2012 (Instrumentation method) to verify which functions take more time and have to be more optimized.
I think that profiling results on my PC can be proportional to that of COrtex_M3 if i don't use FP unit of my PC.
Is there a keyword or way in Visual Studio (2008 pro. or 2012 RC) which allows me to skip the (hardware) FP unit?
your insights are very appreciated
Your PC is soooooo different to a Cortex M3 that optimisations performed there are unlikely to be of any relevance. Some of the differences:
PC can issue more than one instruction per cycle
PC has some billion of those cycles per second vs some tens of millions
PC likely has more cache than your M3 has RAM
As you observe - the floating point unit
The M3 is an embedded processor - if you can't profile in the traditional way either get a better toolset, so that you can, or do it by hand, by using the hardware timers in the device to time your functions. Or toggle some port pins and hang an oscilloscope off it - that's proper embedded :)
EDIT:
You can profile without an OS - higher-end embedded toolchains can instrument the code, run it and pull the results back for post-processing
There are other hardware timers than the watchdog. At the simplest level, write some functions to read the value before you perform some task, read the value afterwards, subtract the result and print it out. More complex schemes can also be done, logging many iterations and keeping track of statistics etc.
If you have a few port pins, just set one before the function(s) you want to profile, clear it when it completes.
With a 4 channel scope you can see the execution times (and when they happen relative to each other, which can be useful if one interrupts another) of 4 sections of code at a time. If you have more, get a logic analyser and you can do loads of them!
You can also see the jitter or variation in execution time which can be instructive. Try it on the libc trig functions as the angle varies, you'll see that at some angles the sin/cos functions (for example) take way longer to run than at other angles. This can be a significant problem in a real-time system.